summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-11-30Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota & reiserfs changes from Jan Kara: - two error checking improvements for quota - remove bogus i_version increase for reiserfs * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Check for register_shrinker() failure. quota: propagate error from __dquot_initialize reiserfs: remove unneeded i_version bump
2017-11-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Mergr misc fixes from Andrew Morton: "28 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits) fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate() mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored" autofs: revert "autofs: take more care to not update last_used on path walk" fs/fat/inode.c: fix sb_rdonly() change mm, memcg: fix mem_cgroup_swapout() for THPs mm: migrate: fix an incorrect call of prep_transhuge_page() kmemleak: add scheduling point to kmemleak_scan() scripts/bloat-o-meter: don't fail with division by 0 fs/mbcache.c: make count_objects() more robust Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" mm/madvise.c: fix madvise() infinite loop under special circumstances exec: avoid RLIMIT_STACK races with prlimit() IB/core: disable memory registration of filesystem-dax vmas v4l2: disable filesystem-dax mapping support mm: fail get_vaddr_frames() for filesystem-dax mappings mm: introduce get_user_pages_longterm device-dax: implement ->split() to catch invalid munmap attempts mm, hugetlbfs: introduce ->split() to vm_operations_struct scripts/faddr2line: extend usage on generic arch ...
2017-11-29fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate()Nadav Amit
hugetlfs_fallocate() currently performs put_page() before unlock_page(). This scenario opens a small time window, from the time the page is added to the page cache, until it is unlocked, in which the page might be removed from the page-cache by another core. If the page is removed during this time windows, it might cause a memory corruption, as the wrong page will be unlocked. It is arguable whether this scenario can happen in a real system, and there are several mitigating factors. The issue was found by code inspection (actually grep), and not by actually triggering the flow. Yet, since putting the page before unlocking is incorrect it should be fixed, if only to prevent future breakage or someone copy-pasting this code. Mike said: "I am of the opinion that this does not need to be sent to stable. Although the ordering is current code is incorrect, there is no way for this to be a problem with current locking. In addition, I verified that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED) for hugetlbfs and other filesystems is addressed in 3a77d214807c ("mm: fadvise: avoid fadvise for fs without backing device")" Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()") Signed-off-by: Nadav Amit <namit@vmware.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored"Ian Kent
Commit 42f461482178 ("autofs: fix AT_NO_AUTOMOUNT not being honored") allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT flag but introduced a semantic change. In order to honor AT_NO_AUTOMOUNT a semantic change was made to the negative dentry case for stat family system calls in follow_automount(). This changed the unconditional triggering of an automount in this case to no longer be done and an error returned instead. This has caused more problems than I expected so reverting the change is needed. In a discussion with Neil Brown it was concluded that the automount(8) daemon can implement this change without kernel modifications. So that will be done instead and the autofs module documentation updated with a description of the problem and what needs to be done by module users for this specific case. Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net Fixes: 42f4614821 ("autofs: fix AT_NO_AUTOMOUNT not being honored") Signed-off-by: Ian Kent <raven@themaw.net> Cc: Neil Brown <neilb@suse.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Colin Walters <walters@redhat.com> Cc: Ondrej Holy <oholy@redhat.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29autofs: revert "autofs: take more care to not update last_used on path walk"Ian Kent
While commit 092a53452bb7 ("autofs: take more care to not update last_used on path walk") helped (partially) resolve a problem where automounts were not expiring due to aggressive accesses from user space it has a side effect for very large environments. This change helps with the expire problem by making the expire more aggressive but, for very large environments, that means more mount requests from clients. When there are a lot of clients that can mean fairly significant server load increases. It turns out I put the last_used in this position to solve this very problem and failed to update my own thinking of the autofs expire policy. So the patch being reverted introduces a regression which should be fixed. Link: http://lkml.kernel.org/r/151174729420.6162.1832622523537052460.stgit@pluto.themaw.net Fixes: 092a53452b ("autofs: take more care to not update last_used on path walk") Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: <stable@vger.kernel.org> [4.11+] Cc: Colin Walters <walters@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Ondrej Holy <oholy@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29fs/fat/inode.c: fix sb_rdonly() changeOGAWA Hirofumi
Commit bc98a42c1f7d ("VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)") converted fat_remount():new_rdonly from a bool to an int. However fat_remount() depends upon the compiler's conversion of a non-zero integer into boolean `true'. Fix it by switching `new_rdonly' back into a bool. Link: http://lkml.kernel.org/r/87mv3d5x51.fsf@mail.parknet.co.jp Fixes: bc98a42c1f7d0f8 ("VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)") Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Joe Perches <joe@perches.com> Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29fs/mbcache.c: make count_objects() more robustJiang Biao
When running ltp stress test for 7*24 hours, vmscan occasionally emits the following warning continuously: mb_cache_scan+0x0/0x3f0 negative objects to delete nr=-9232265467809300450 ... Tracing shows the freeable(mb_cache_count returns) is -1, which causes the continuous accumulation and overflow of total_scan. This patch makes sure that mb_cache_count() cannot return a negative value, which makes the mbcache shrinker more robust. Link: http://lkml.kernel.org/r/1511753419-52328-1-git-send-email-jiang.biao2@zte.com.cn Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <zhong.weidong@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29exec: avoid RLIMIT_STACK races with prlimit()Kees Cook
While the defense-in-depth RLIMIT_STACK limit on setuid processes was protected against races from other threads calling setrlimit(), I missed protecting it against races from external processes calling prlimit(). This adds locking around the change and makes sure that rlim_max is set too. Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Reported-by: Brad Spengler <spender@grsecurity.net> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29mm: replace pmd_write with pmd_access_permitted in fault + gup pathsDan Williams
The 'access_permitted' helper is used in the gup-fast path and goes beyond the simple _PAGE_RW check to also: - validate that the mapping is writable from a protection keys standpoint - validate that the pte has _PAGE_USER set since all fault paths where pmd_write is must be referencing user-memory. Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29Merge tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "I screwed up my merge window pull request; I only sent half of what I meant to. There were no new features, just bugfixes of various importance and some very minor cleanup, so I think it's all still appropriate for -rc2. Highlights: - Fixes from Trond for some races in the NFSv4 state code. - Fix from Naofumi Honda for a typo in the blocked lock notificiation code - Fixes from Vasily Averin for some problems starting and stopping lockd especially in network namespaces" * tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux: (23 commits) lockd: fix "list_add double add" caused by legacy signal interface nlm_shutdown_hosts_net() cleanup race of nfsd inetaddr notifiers vs nn->nfsd_serv change race of lockd inetaddr notifiers vs nlmsvc_rqst change SUNRPC: make cache_detail structures const NFSD: make cache_detail structures const sunrpc: make the function arg as const nfsd: check for use of the closed special stateid nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat lockd: lost rollback of set_grace_period() in lockd_down_net() lockd: added cleanup checks in exit_net hook grace: replace BUG_ON by WARN_ONCE in exit_net hook nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class lockd: remove net pointer from messages nfsd: remove net pointer from debug messages nfsd: Fix races with check_stateid_generation() nfsd: Ensure we check stateid validity in the seqid operation checks nfsd: Fix race in lock stateid creation nfsd4: move find_lock_stateid nfsd: Ensure we don't recognise lock stateids after freeing them ...
2017-11-29Merge tag 'for-4.15-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've collected some fixes in since the pre-merge window freeze. There's technically only one regression fix for 4.15, but the rest seems important and candidates for stable. - fix missing flush bio puts in error cases (is serious, but rarely happens) - fix reporting stat::st_blocks for buffered append writes - fix space cache invalidation - fix out of bound memory access when setting zlib level - fix potential memory corruption when fsync fails in the middle - fix crash in integrity checker - incremetnal send fix, path mixup for certain unlink/rename combination - pass flags to writeback so compressed writes can be throttled properly - error handling fixes" * tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: incremental send, fix wrong unlink path after renaming file btrfs: tree-checker: Fix false panic for sanity test Btrfs: fix list_add corruption and soft lockups in fsync btrfs: Fix wild memory access in compression level parser btrfs: fix deadlock when writing out space cache btrfs: clear space cache inode generation always Btrfs: fix reported number of inode blocks after buffered append writes Btrfs: move definition of the function btrfs_find_new_delalloc_bytes Btrfs: bail out gracefully rather than BUG_ON btrfs: dev_alloc_list is not protected by RCU, use normal list_del btrfs: add missing device::flush_bio puts btrfs: Fix transaction abort during failure in btrfs_rm_dev_item Btrfs: add write_flags for compression bio
2017-11-29quota: Check for register_shrinker() failure.Tetsuo Handa
register_shrinker() might return -ENOMEM error since Linux 3.12. Call panic() as with other failure checks in this function if register_shrinker() failed. Fixes: 1d3d4437eae1 ("vmscan: per-node deferred work") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jan Kara <jack@suse.com> Cc: Michal Hocko <mhocko@suse.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-28Btrfs: incremental send, fix wrong unlink path after renaming fileFilipe Manana
Under some circumstances, an incremental send operation can issue wrong paths for unlink commands related to files that have multiple hard links and some (or all) of those links were renamed between the parent and send snapshots. Consider the following example: Parent snapshot . (ino 256) |---- a/ (ino 257) | |---- b/ (ino 259) | | |---- c/ (ino 260) | | |---- f2 (ino 261) | | | |---- f2l1 (ino 261) | |---- d/ (ino 262) |---- f1l1_2 (ino 258) |---- f2l2 (ino 261) |---- f1_2 (ino 258) Send snapshot . (ino 256) |---- a/ (ino 257) | |---- f2l1/ (ino 263) | |---- b2/ (ino 259) | |---- c/ (ino 260) | | |---- d3 (ino 262) | | |---- f1l1_2 (ino 258) | | |---- f2l2_2 (ino 261) | | |---- f1_2 (ino 258) | | | |---- f2 (ino 261) | |---- f1l2 (ino 258) | |---- d (ino 261) When computing the incremental send stream the following steps happen: 1) When processing inode 261, a rename operation is issued that renames inode 262, which currently as a path of "d", to an orphan name of "o262-7-0". This is done because in the send snapshot, inode 261 has of its hard links with a path of "d" as well. 2) Two link operations are issued that create the new hard links for inode 261, whose names are "d" and "f2l2_2", at paths "/" and "o262-7-0/" respectively. 3) Still while processing inode 261, unlink operations are issued to remove the old hard links of inode 261, with names "f2l1" and "f2l2", at paths "a/" and "d/". However path "d/" does not correspond anymore to the directory inode 262 but corresponds instead to a hard link of inode 261 (link command issued in the previous step). This makes the receiver fail with a ENOTDIR error when attempting the unlink operation. The problem happens because before sending the unlink operation, we failed to detect that inode 262 was one of ancestors for inode 261 in the parent snapshot, and therefore we didn't recompute the path for inode 262 before issuing the unlink operation for the link named "f2l2" of inode 262. The detection failed because the function "is_ancestor()" only follows the first hard link it finds for an inode instead of all of its hard links (as it was originally created for being used with directories only, for which only one hard link exists). So fix this by making "is_ancestor()" follow all hard links of the input inode. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-28quota: propagate error from __dquot_initializeChao Yu
In commit 6184fc0b8dd7 ("quota: Propagate error from ->acquire_dquot()"), we have propagated error from __dquot_initialize to caller, but we forgot to handle such error in add_dquot_ref(), so, currently, during quota accounting information initialization flow, if we failed for some of inodes, we just ignore such error, and do account for others, which is not a good implementation. In this patch, we choose to let user be aware of such error, so after turning on quota successfully, we can make sure all inodes disk usage can be accounted, which will be more reasonable. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-28btrfs: tree-checker: Fix false panic for sanity testQu Wenruo
[BUG] If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will instantly cause kernel panic like: ------ ... assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853 ... Call Trace: btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs] setup_items_for_insert+0x385/0x650 [btrfs] __btrfs_drop_extents+0x129a/0x1870 [btrfs] ... ----- [Cause] Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y. However quite some btrfs_mark_buffer_dirty() callers(*) don't really initialize its item data but only initialize its item pointers, leaving item data uninitialized. This makes tree-checker catch uninitialized data as error, causing such panic. *: These callers include but not limited to setup_items_for_insert() btrfs_split_item() btrfs_expand_item() [Fix] Add a new parameter @check_item_data to btrfs_check_leaf(). With @check_item_data set to false, item data check will be skipped and fallback to old btrfs_check_leaf() behavior. So we can still get early warning if we screw up item pointers, and avoid false panic. Cc: Filipe Manana <fdmanana@gmail.com> Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27proc: don't report kernel addresses in /proc/<pid>/stackLinus Torvalds
This just changes the file to report them as zero, although maybe even that could be removed. I checked, and at least procps doesn't actually seem to parse the 'stack' file at all. And since the file doesn't necessarily even exist (it requires CONFIG_STACKTRACE), possibly other tools don't really use it either. That said, in case somebody parses it with tools, just having that zero there should keep such tools happy. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27lockd: fix "list_add double add" caused by legacy signal interfaceVasily Averin
restart_grace() uses hardcoded init_net. It can cause to "list_add double add" in following scenario: 1) nfsd and lockd was started in several net namespaces 2) nfsd in init_net was stopped (lockd was not stopped because it have users from another net namespaces) 3) lockd got signal, called restart_grace() -> set_grace_period() and enabled lock_manager in hardcoded init_net. 4) nfsd in init_net is started again, its lockd_up() calls set_grace_period() and tries to add lock_manager into init_net 2nd time. Jeff Layton suggest: "Make it safe to call locks_start_grace multiple times on the same lock_manager. If it's already on the global grace_list, then don't try to add it again. (But we don't intentionally add twice, so for now we WARN about that case.) With this change, we also need to ensure that the nfsd4 lock manager initializes the list before we call locks_start_grace. While we're at it, move the rest of the nfsd_net initialization into nfs4_state_create_net. I see no reason to have it spread over two functions like it is today." Suggested patch was updated to generate warning in described situation. Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nlm_shutdown_hosts_net() cleanupVasily Averin
nlm_complain_hosts() walks through nlm_server_hosts hlist, which should be protected by nlm_host_mutex. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27race of nfsd inetaddr notifiers vs nn->nfsd_serv changeVasily Averin
nfsd_inet[6]addr_event uses nn->nfsd_serv without taking nfsd_mutex, which can be changed during execution of notifiers and crash the host. Moreover if notifiers were enabled in one net namespace they are enabled in all other net namespaces, from creation until destruction. This patch allows notifiers to access nn->nfsd_serv only after the pointer is correctly initialized and delays cleanup until notifiers are no longer in use. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Tested-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27race of lockd inetaddr notifiers vs nlmsvc_rqst changeVasily Averin
lockd_inet[6]addr_event use nlmsvc_rqst without taken nlmsvc_mutex, nlmsvc_rqst can be changed during execution of notifiers and crash the host. Patch enables access to nlmsvc_rqst only when it was correctly initialized and delays its cleanup until notifiers are no longer in use. Note that nlmsvc_rqst can be temporally set to ERR_PTR, so the "if (nlmsvc_rqst)" check in notifiers is insufficient on its own. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Tested-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27NFSD: make cache_detail structures constBhumika Goyal
Make these const as they are only getting passed to the function cache_create_net having the argument as const. Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: check for use of the closed special stateidAndrew Elble
Prevent the use of the closed (invalid) special stateid by clients. Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: fix panic in posix_unblock_lock called from nfs4_laundromatNaofumi Honda
From kernel 4.9, my two nfsv4 servers sometimes suffer from "panic: unable to handle kernel page request" in posix_unblock_lock() called from nfs4_laundromat(). These panics diseappear if we revert the commit "nfsd: add a LRU list for blocked locks". The cause appears to be a typo in nfs4_laundromat(), which is also present in nfs4_state_shutdown_net(). Cc: stable@vger.kernel.org Fixes: 7919d0a27f1e "nfsd: add a LRU list for blocked locks" Cc: jlayton@redhat.com Reveiwed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27lockd: lost rollback of set_grace_period() in lockd_down_net()Vasily Averin
Commit efda760fe95ea ("lockd: fix lockd shutdown race") is incorrect, it removes lockd_manager and disarm grace_period_end for init_net only. If nfsd was started from another net namespace lockd_up_net() calls set_grace_period() that adds lockd_manager into per-netns list and queues grace_period_end delayed work. These action should be reverted in lockd_down_net(). Otherwise it can lead to double list_add on after restart nfsd in netns, and to use-after-free if non-disarmed delayed work will be executed after netns destroy. Fixes: efda760fe95e ("lockd: fix lockd shutdown race") Cc: stable@vger.kernel.org Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27lockd: added cleanup checks in exit_net hookVasily Averin
Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27grace: replace BUG_ON by WARN_ONCE in exit_net hookVasily Averin
Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex classAndrew Elble
The use of the st_mutex has been confusing the validator. Use the proper nested notation so as to not produce warnings. Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27lockd: remove net pointer from messagesVasily Averin
Publishing of net pointer is not safe, use net->ns.inum as net ID in debug messages [ 171.757678] lockd_up_net: per-net data created; net=f00001e7 [ 171.767188] NFSD: starting 90-second grace period (net f00001e7) [ 300.653313] lockd: nuking all hosts in net f00001e7... [ 300.653641] lockd: host garbage collection for net f00001e7 [ 300.653968] lockd: nlmsvc_mark_resources for net f00001e7 [ 300.711483] lockd_down_net: per-net data destroyed; net=f00001e7 [ 300.711847] lockd: nuking all hosts in net 0... [ 300.711847] lockd: host garbage collection for net 0 [ 300.711848] lockd: nlmsvc_mark_resources for net 0 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: remove net pointer from debug messagesVasily Averin
Publishing of net pointer is not safe, replace it in debug meesages by net->ns.inum [ 119.989161] nfsd: initializing export module (net: f00001e7). [ 171.767188] NFSD: starting 90-second grace period (net f00001e7) [ 322.185240] nfsd: shutting down export module (net: f00001e7). [ 322.186062] nfsd: export shutdown complete (net: f00001e7). Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: Fix races with check_stateid_generation()Trond Myklebust
The various functions that call check_stateid_generation() in order to compare a client-supplied stateid with the nfs4_stid state, usually need to atomically check for closed state. Those that perform the check after locking the st_mutex using nfsd4_lock_ol_stateid() should now be OK, but we do want to fix up the others. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: Ensure we check stateid validity in the seqid operation checksTrond Myklebust
After taking the stateid st_mutex, we want to know that the stateid still represents valid state before performing any non-idempotent actions. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: Fix race in lock stateid creationTrond Myklebust
If we're looking up a new lock state, and the creation fails, then we want to unhash it, just like we do for OPEN. However in order to do so, we need to that no other LOCK requests can grab the mutex until we have unhashed it (and marked it as closed). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd4: move find_lock_stateidTrond Myklebust
Trivial cleanup to simplify following patch. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: Ensure we don't recognise lock stateids after freeing themTrond Myklebust
In order to deal with lookup races, nfsd4_free_lock_stateid() needs to be able to signal to other stateful functions that the lock stateid is no longer valid. Right now, nfsd_lock() will check whether or not an existing stateid is still hashed, but only in the "new lock" path. To ensure the stateid invalidation is also recognised by the "existing lock" path, and also by a second call to nfsd4_free_lock_stateid() itself, we can change the type to NFS4_CLOSED_STID under the stp->st_mutex. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: CLOSE SHOULD return the invalid special stateid for NFSv4.x (x>0)Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: Fix another OPEN stateid raceTrond Myklebust
If nfsd4_process_open2() is initialising a new stateid, and yet the call to nfs4_get_vfs_file() fails for some reason, then we must declare the stateid closed, and unhash it before dropping the mutex. Right now, we unhash the stateid after dropping the mutex, and without changing the stateid type, meaning that another OPEN could theoretically look it up and attempt to use it. Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27nfsd: Fix stateid races between OPEN and CLOSETrond Myklebust
Open file stateids can linger on the nfs4_file list of stateids even after they have been closed. In order to avoid reusing such a stateid, and confusing the client, we need to recheck the nfs4_stid's type after taking the mutex. Otherwise, we risk reusing an old stateid that was already closed, which will confuse clients that expect new stateids to conform to RFC7530 Sections 9.1.4.2 and 16.2.5 or RFC5661 Sections 8.2.2 and 18.2.4. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27Rename superblock flags (MS_xyz -> SB_xyz)Linus Torvalds
This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27Btrfs: fix list_add corruption and soft lockups in fsyncLiu Bo
Xfstests btrfs/146 revealed this corruption, [ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read [ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 [ 58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88). [ 58.153518] ------------[ cut here ]------------ [ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0 ... [ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0 ... [ 58.161956] Call Trace: [ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs] [ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs] [ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs] [ 58.164393] vfs_fsync_range+0x5f/0xd0 [ 58.164898] do_fsync+0x5a/0x90 [ 58.165170] SyS_fsync+0x10/0x20 [ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe ... It turns out that we could record btrfs_log_ctx:io_err in log_one_extents when IO fails, but make log_one_extents() return '0' instead of -EIO, so the IO error is not acknowledged by the callers, i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated from stack memory, it'd get freed with a object alive on the list. then a future list_add will throw the above warning. This returns the correct error in the above case. Jeff also reported this while testing against his fsync error patch set[1]. [1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html "btrfs list corruption and soft lockups while testing writeback error handling" Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and writeback error") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27reiserfs: remove unneeded i_version bumpJeff Layton
The i_version field in reiserfs is not initialized and is only ever updated here. Nothing ever views it, so just remove it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-27btrfs: Fix wild memory access in compression level parserQu Wenruo
[BUG] Kernel panic when mounting with "-o compress" mount option. KASAN will report like: ------ ================================================================== BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0 Read of size 1 at addr d86735fce994f800 by task mount/662 ... Call Trace: dump_stack+0xe3/0x175 kasan_report+0x163/0x370 __asan_load1+0x47/0x50 strncmp+0x31/0xc0 btrfs_compress_str2level+0x20/0x70 [btrfs] btrfs_parse_options+0xff4/0x1870 [btrfs] open_ctree+0x2679/0x49f0 [btrfs] btrfs_mount+0x1b7f/0x1d30 [btrfs] mount_fs+0x49/0x190 vfs_kern_mount.part.29+0xba/0x280 vfs_kern_mount+0x13/0x20 btrfs_mount+0x31e/0x1d30 [btrfs] mount_fs+0x49/0x190 vfs_kern_mount.part.29+0xba/0x280 do_mount+0xaad/0x1a00 SyS_mount+0x98/0xe0 entry_SYSCALL_64_fastpath+0x1f/0xbe ------ [Cause] For 'compress' and 'compress_force' options, its token doesn't expect any parameter so its args[0] contains uninitialized data. Accessing args[0] will cause above wild memory access. [Fix] For Opt_compress and Opt_compress_force, set compression level to the default. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ set the default in advance ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27btrfs: fix deadlock when writing out space cacheJosef Bacik
If we fail to prepare our pages for whatever reason (out of memory in our case) we need to make sure to drop the block_group->data_rwsem, otherwise hilarity ensues. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add label and use existing unlocking code ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: - The final conversion of timer wheel timers to timer_setup(). A few manual conversions and a large coccinelle assisted sweep and the removal of the old initialization mechanisms and the related code. - Remove the now unused VSYSCALL update code - Fix permissions of /proc/timer_list. I still need to get rid of that file completely - Rename a misnomed clocksource function and remove a stale declaration * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) m68k/macboing: Fix missed timer callback assignment treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts timer: Remove redundant __setup_timer*() macros timer: Pass function down to initialization routines timer: Remove unused data arguments from macros timer: Switch callback prototype to take struct timer_list * argument timer: Pass timer_list pointer to callbacks unconditionally Coccinelle: Remove setup_timer.cocci timer: Remove setup_*timer() interface timer: Remove init_timer() interface treewide: setup_timer() -> timer_setup() (2 field) treewide: setup_timer() -> timer_setup() treewide: init_timer() -> setup_timer() treewide: Switch DEFINE_TIMER callbacks to struct timer_list * s390: cmm: Convert timers to use timer_setup() lightnvm: Convert timers to use timer_setup() drivers/net: cris: Convert timers to use timer_setup() drm/vc4: Convert timers to use timer_setup() block/laptop_mode: Convert timers to use timer_setup() net/atm/mpc: Avoid open-coded assignment of timer callback function ...
2017-11-25Merge tag 'afs-fixes-20171124' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: - Make AFS file locking work again. - Don't write to a page that's being written out, but wait for it to complete. - Do d_drop() and d_add() in the right places. - Put keys on error paths. - Remove some redundant code. * tag 'afs-fixes-20171124' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: remove redundant assignment of dvnode to itself afs: cell: Remove unnecessary code in afs_lookup_cell afs: Fix signal handling in some file ops afs: Fix some dentry handling in dir ops and missing key_puts afs: Make afs_write_begin() avoid writing to a page that's being stored afs: Fix file locking
2017-11-24afs: remove redundant assignment of dvnode to itselfColin Ian King
The assignment of dvnode to itself is redundant and can be removed. Cleans up warning detected by cppcheck: fs/afs/dir.c:975: (warning) Redundant assignment of 'dvnode' to itself. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24afs: cell: Remove unnecessary code in afs_lookup_cellGustavo A. R. Silva
Due to recent changes this piece of code is no longer needed. Addresses-Coverity-ID: 1462033 Link: https://lkml.kernel.org/r/4923.1510957307@warthog.procyon.org.uk Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24afs: Fix signal handling in some file opsDavid Howells
afs_mkdir(), afs_create(), afs_link() and afs_symlink() all need to drop the target dentry if a signal causes the operation to be killed immediately before we try to contact the server. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24afs: Fix some dentry handling in dir ops and missing key_putsDavid Howells
Fix some of dentry handling in AFS directory ops: (1) Do d_drop() on the new_dentry before assigning a new inode to it in afs_vnode_new_inode(). It's fine to do this before calling afs_iget() because the operation has taken place on the server. (2) Replace d_instantiate()/d_rehash() with d_add(). (3) Don't d_drop() the new_dentry in afs_rename() on error. Also fix afs_link() and afs_rename() to call key_put() on all error paths where the key is taken. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24afs: Make afs_write_begin() avoid writing to a page that's being storedDavid Howells
Make afs_write_begin() wait for a page that's marked PG_writeback because: (1) We need to avoid interference with the data being stored so that the data on the server ends up in a defined state. (2) page->private is used to track the window of dirty data within a page, but it's also used by the storage code to track what's being written, being cleared by the completion notification. Ownership can't be relinquished by the storage code until completion because it a store fails, the data must be remarked dirty. Tracing shows something like the following (edited): x86_64-linux-gn-15940 [1] afs_page_dirty: vn=ffff8800bef33800 9c75 begin 0-125 kworker/u8:3-114 [2] afs_page_dirty: vn=ffff8800bef33800 9c75 store+ 0-125 x86_64-linux-gn-15940 [1] afs_page_dirty: vn=ffff8800bef33800 9c75 begin 0-2052 kworker/u8:3-114 [2] afs_page_dirty: vn=ffff8800bef33800 9c75 clear 0-2052 kworker/u8:3-114 [2] afs_page_dirty: vn=ffff8800bef33800 9c75 store 0-0 kworker/u8:3-114 [2] afs_page_dirty: vn=ffff8800bef33800 9c75 WARN 0-0 The clear (completion) corresponding to the store+ (store continuation from a previous page) happens between the second begin (afs_write_begin) and the store corresponding to that. This results in the second store not seeing any data to write back, leading to the following warning: WARNING: CPU: 2 PID: 114 at ../fs/afs/write.c:403 afs_write_back_from_locked_page+0x19d/0x76c [kafs] Modules linked in: kafs(E) CPU: 2 PID: 114 Comm: kworker/u8:3 Tainted: G E 4.14.0-fscache+ #242 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Workqueue: writeback wb_workfn (flush-afs-2) task: ffff8800cad72600 task.stack: ffff8800cad44000 RIP: 0010:afs_write_back_from_locked_page+0x19d/0x76c [kafs] RSP: 0018:ffff8800cad47aa0 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff8800bef33a20 RCX: 0000000000000000 RDX: 000000000000000f RSI: ffffffff81c5d0e0 RDI: ffff8800cad72e78 RBP: ffff8800d31ea1e8 R08: ffff8800c1358000 R09: ffff8800ca00e400 R10: ffff8800cad47a38 R11: ffff8800c5d9e400 R12: 0000000000000000 R13: ffffea0002d9df00 R14: ffffffffa0023c1c R15: 0000000000007fdf FS: 0000000000000000(0000) GS:ffff8800ca700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f85ac6c4000 CR3: 0000000001c10001 CR4: 00000000001606e0 Call Trace: ? clear_page_dirty_for_io+0x23a/0x267 afs_writepages_region+0x1be/0x286 [kafs] afs_writepages+0x60/0x127 [kafs] do_writepages+0x36/0x70 __writeback_single_inode+0x12f/0x635 writeback_sb_inodes+0x2cc/0x452 __writeback_inodes_wb+0x68/0x9f wb_writeback+0x208/0x470 ? wb_workfn+0x22b/0x565 wb_workfn+0x22b/0x565 ? worker_thread+0x230/0x2ac process_one_work+0x2cc/0x517 ? worker_thread+0x230/0x2ac worker_thread+0x1d4/0x2ac ? rescuer_thread+0x29b/0x29b kthread+0x15d/0x165 ? kthread_create_on_node+0x3f/0x3f ? call_usermodehelper_exec_async+0x118/0x11f ret_from_fork+0x24/0x30 Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-22Merge tag 'xfs-4.15-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: - Fix a memory leak in the new in-core extent map - Refactor the xfs_dev_t conversions for easier xfsprogs porting * tag 'xfs-4.15-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: abstract out dev_t conversions xfs: fix memory leak in xfs_iext_free_last_leaf