summaryrefslogtreecommitdiff
path: root/fs/fuse/dir.c
AgeCommit message (Collapse)Author
2022-12-12Merge tag 'fuse-update-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: - Allow some write requests to proceed in parallel - Fix a performance problem with allow_sys_admin_access - Add a special kind of invalidation that doesn't immediately purge submounts - On revalidation treat the target of rename(RENAME_NOREPLACE) the same as open(O_EXCL) - Use type safe helpers for some mnt_userns transformations * tag 'fuse-update-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: Rearrange fuse_allow_current_process checks fuse: allow non-extending parallel direct writes on the same file fuse: remove the unneeded result variable fuse: port to vfs{g,u}id_t and associated helpers fuse: Remove user_ns check for FUSE_DEV_IOC_CLONE fuse: always revalidate rename target dentry fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY fs/fuse: Replace kmap() with kmap_local_page()
2022-11-23fuse: Rearrange fuse_allow_current_process checksDave Marchevsky
This is a followup to a previous commit of mine [0], which added the allow_sys_admin_access && capable(CAP_SYS_ADMIN) check. This patch rearranges the order of checks in fuse_allow_current_process without changing functionality. Commit 9ccf47b26b73 ("fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other") added allow_sys_admin_access && capable(CAP_SYS_ADMIN) check to the beginning of the function, with the reasoning that allow_sys_admin_access should be an 'escape hatch' for users with CAP_SYS_ADMIN, allowing them to skip any subsequent checks. However, placing this new check first results in many capable() calls when allow_sys_admin_access is set, where another check would've also returned 1. This can be problematic when a BPF program is tracing capable() calls. At Meta we ran into such a scenario recently. On a host where allow_sys_admin_access is set but most of the FUSE access is from processes which would pass other checks - i.e. they don't need CAP_SYS_ADMIN 'escape hatch' - this results in an unnecessary capable() call for each fs op. We also have a daemon tracing capable() with BPF and doing some data collection, so tracing these extraneous capable() calls has the potential to regress performance for an application doing many FUSE ops. So rearrange the order of these checks such that CAP_SYS_ADMIN 'escape hatch' is checked last. Add a small helper, fuse_permissible_uidgid, to make the logic easier to understand. Previously, if allow_other is set on the fuse_conn, uid/git checking doesn't happen as current_in_userns result is returned. These semantics are maintained here: fuse_permissible_uidgid check only happens if allow_other is not set. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23fuse: always revalidate rename target dentryJiachen Zhang
The previous commit df8629af2934 ("fuse: always revalidate if exclusive create") ensures that the dentries are revalidated on O_EXCL creates. This commit complements it by also performing revalidation for rename target dentries. Otherwise, a rename target file that only exists in kernel dentry cache but not in the filesystem will result in EEXIST if RENAME_NOREPLACE flag is used. Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRYMiklos Szeredi
Add a flag to entry expiration that lets the filesystem expire a dentry without kicking it out from the cache immediately. This makes a difference for overmounted dentries, where plain invalidation would detach all submounts before dropping the dentry from the cache. If only expiry is set on the dentry, then any overmounts are left alone and until ->d_revalidate() is called. Note: ->d_revalidate() is not called for the case of following a submount, so invalidation will only be triggered for the non-overmounted case. The dentry could also be mounted in a different mount instance, in which case any submounts will still be detached. Suggested-by: Jakob Blomer <jblomer@cern.ch> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-10-20fs: rename current get acl methodChristian Brauner
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-09-24fuse: implement ->tmpfile()Miklos Szeredi
This is basically equivalent to the FUSE_CREATE operation which creates and opens a regular file. Add a new FUSE_TMPFILE operation, otherwise just reuse the protocol and the code for FUSE_CREATE. Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_otherDave Marchevsky
Since commit 73f03c2b4b52 ("fuse: Restrict allow_other to the superblock's namespace or a descendant"), access to allow_other FUSE filesystems has been limited to users in the mounting user namespace or descendants. This prevents a process that is privileged in its userns - but not its parent namespaces - from mounting a FUSE fs w/ allow_other that is accessible to processes in parent namespaces. While this restriction makes sense overall it breaks a legitimate usecase: I have a tracing daemon which needs to peek into process' open files in order to symbolicate - similar to 'perf'. The daemon is a privileged process in the root userns, but is unable to peek into FUSE filesystems mounted by processes in child namespaces. This patch adds a module param, allow_sys_admin_access, to act as an escape hatch for this descendant userns logic and for the allow_other mount option in general. Setting allow_sys_admin_access allows processes with CAP_SYS_ADMIN in the initial userns to access FUSE filesystems irrespective of the mounting userns or whether allow_other was set. A sysadmin setting this param must trust FUSEs on the host to not DoS processes as described in 73f03c2b4b52. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21fuse: fix deadlock between atomic O_TRUNC and page invalidationMiklos Szeredi
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache() in such a case to avoid the A-A deadlock. However, we found another A-B-B-A deadlock related to the case above, which will cause the xfstests generic/464 testcase hung in our virtio-fs test environment. For example, consider two processes concurrently open one same file, one with O_TRUNC and another without O_TRUNC. The deadlock case is described below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying to lock a page (acquiring B), open() could have held the page lock (acquired B), and waiting on the page writeback (acquiring A). This would lead to deadlocks. open(O_TRUNC) ---------------------------------------------------------------- fuse_open_common inode_lock [C acquire] fuse_set_nowrite [A acquire] fuse_finish_open truncate_pagecache lock_page [B acquire] truncate_inode_page unlock_page [B release] fuse_release_nowrite [A release] inode_unlock [C release] ---------------------------------------------------------------- open() ---------------------------------------------------------------- fuse_open_common fuse_finish_open invalidate_inode_pages2 lock_page [B acquire] fuse_launder_page fuse_wait_on_page_writeback [A acquire & release] unlock_page [B release] ---------------------------------------------------------------- Besides this case, all calls of invalidate_inode_pages2() and invalidate_inode_pages2_range() in fuse code also can deadlock with open(O_TRUNC). Fix by moving the truncate_pagecache() call outside the nowrite protected region. The nowrite protection is only for delayed writeback (writeback_cache) case, where inode lock does not protect against truncation racing with writes on the server. Write syscalls racing with page cache truncation still get the inode lock protection. This patch also changes the order of filemap_invalidate_lock() vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the order found in fuse_file_fallocate() and fuse_do_setattr(). Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-05-09fuse: Convert fuse to read_folioMatthew Wilcox (Oracle)
This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the filesystem. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-15fuse: Convert from launder_page to launder_folioMatthew Wilcox (Oracle)
Straightforward conversion although the helper functions still assume a single page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2021-11-25fuse: send security context of inode on fileVivek Goyal
When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: only update necessary attributesMiklos Szeredi
fuse_update_attributes() refreshes metadata for internal use. Each use needs a particular set of attributes to be refreshed, but currently that cannot be expressed and all but atime are refreshed. Add a mask argument, which lets fuse_update_get_attr() to decide based on the cache_mask and the inval_mask whether a GETATTR call is needed or not. Reported-by: Yongji Xie <xieyongji@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: take cache_mask into account in getattrMiklos Szeredi
When deciding to send a GETATTR request take into account the cache mask (which attributes are always valid). The cache mask takes precedence over the invalid mask. This results in the GETATTR request not being sent unnecessarily. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: add cache_maskMiklos Szeredi
If writeback_cache is enabled, then the size, mtime and ctime attributes of regular files are always valid in the kernel's cache. They are retrieved from userspace only when the inode is freshly looked up. Add a more generic "cache_mask", that indicates which attributes are currently valid in cache. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: move reverting attributes to fuse_change_attributes()Miklos Szeredi
In case of writeback_cache fuse_fillattr() would revert the queried attributes to the cached version. Move this to fuse_change_attributes() in order to manage the writeback logic in a central helper. This will be necessary for patches that follow. Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling fuse_change_attributes(), so this should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: simplify local variables holding writeback cache stateMiklos Szeredi
There are two instances of "bool is_wb = fc->writeback_cache" where the actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)". Clean up these cases by storing the second condition in the local variable. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: selective attribute invalidationMiklos Szeredi
Only invalidate attributes that the operation might have changed. Introduce two constants for common combinations of changed attributes: FUSE_STATX_MODIFY: file contents are modified but not size FUSE_STATX_MODSIZE: size and/or file contents modified Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: don't increment nlink in link()Miklos Szeredi
The fuse_iget() call in create_new_entry() already updated the inode with all the new attributes and incremented the attribute version. Incrementing the nlink will result in the wrong count. This wasn't noticed because the attributes were invalidated right after this. Updating ctime is still needed for the writeback case when the ctime is not refreshed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22fuse: decrement nlink on overwriting renameMiklos Szeredi
Rename didn't decrement/clear nlink on overwritten target inode. Create a common helper fuse_entry_unlinked() that handles this for unlink, rmdir and rename. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22fuse: move fuse_invalidate_attr() into fuse_update_ctime()Miklos Szeredi
Logically it belongs there since attributes are invalidated due to the updated ctime. This is a cleanup and should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22fuse: annotate lock in fuse_reverse_inval_entry()Miklos Szeredi
Add missing inode lock annotatation; found by syzbot. Reported-and-tested-by: syzbot+9f747458f5990eaa8d43@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22fuse: make sure reclaim doesn't write the inodeMiklos Szeredi
In writeback cache mode mtime/ctime updates are cached, and flushed to the server using the ->write_inode() callback. Closing the file will result in a dirty inode being immediately written, but in other cases the inode can remain dirty after all references are dropped. This result in the inode being written back from reclaim, which can deadlock on a regular allocation while the request is being served. The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because serving a request involves unrelated userspace process(es). Instead do the same as for dirty pages: make sure the inode is written before the last reference is gone. - fallocate(2)/copy_file_range(2): these call file_update_time() or file_modified(), so flush the inode before returning from the call - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so flush the ctime directly from this helper Reported-by: chenguanyou <chenguanyou@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-07-13fuse: Convert to using invalidate_lockJan Kara
Use invalidate_lock instead of fuse's private i_mmap_sem. The intended purpose is exactly the same. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-22fuse: fix illegal access to inode with reused nodeidAmir Goldstein
Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...) with ourarg containing nodeid and generation. If a fuse inode is found in inode cache with the same nodeid but different generation, the existing fuse inode should be unhashed and marked "bad" and a new inode with the new generation should be hashed instead. This can happen, for example, with passhrough fuse filesystem that returns the real filesystem ino/generation on lookup and where real inode numbers can get recycled due to real files being unlinked not via the fuse passthrough filesystem. With current code, this situation will not be detected and an old fuse dentry that used to point to an older generation real inode, can be used to access a completely new inode, which should be accessed only via the new dentry. Note that because the FORGET message carries the nodeid w/o generation, the server should wait to get FORGET counts for the nlookup counts of the old and reused inodes combined, before it can free the resources associated to that nodeid. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22fuse: Switch to fc_mount() for submountsGreg Kurz
fc_mount() already handles the vfs_get_tree(), sb->s_umount unlocking and vfs_create_mount() sequence. Using it greatly simplifies fuse_dentry_automount(). Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22fuse: Call vfs_get_tree() for submountsGreg Kurz
We recently fixed an infinite loop by setting the SB_BORN flag on submounts along with the write barrier needed by super_cache_count(). This is the job of vfs_get_tree() and FUSE shouldn't have to care about the barrier at all. Split out some code from fuse_dentry_automount() to the dedicated fuse_get_tree_submount() handler for submounts and call vfs_get_tree(). Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09fuse: Fix infinite loop in sget_fc()Greg Kurz
We don't set the SB_BORN flag on submounts. This is wrong as these superblocks are then considered as partially constructed or dying in the rest of the code and can break some assumptions. One such case is when you have a virtiofs filesystem with submounts and you try to mount it again : virtio_fs_get_tree() tries to obtain a superblock with sget_fc(). The logic in sget_fc() is to loop until it has either found an existing matching superblock with SB_BORN set or to create a brand new one. It is assumed that a superblock without SB_BORN is transient and the loop is restarted. Forgetting to set SB_BORN on submounts hence causes sget_fc() to retry forever. Setting SB_BORN requires special care, i.e. a write barrier for super_cache_count() which can check SB_BORN without taking any lock. We should call vfs_get_tree() to deal with that but this requires to have a proper ->get_tree() implementation for submounts, which is a bigger piece of work. Go for a simple bug fix in the meatime. Fixes: bf109c64040f ("fuse: implement crossmounts") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09fuse: Fix crash if superblock of submount gets killed earlyGreg Kurz
As soon as fuse_dentry_automount() does up_write(&sb->s_umount), the superblock can theoretically be killed. If this happens before the submount was added to the &fc->mounts list, fuse_mount_remove() later crashes in list_del_init() because it assumes the submount to be already there. Add the submount before dropping sb->s_umount to fix the inconsistency. It is okay to nest fc->killsb under sb->s_umount, we already do this on the ->kill_sb() path. Signed-off-by: Greg Kurz <groug@kaod.org> Fixes: bf109c64040f ("fuse: implement crossmounts") Cc: stable@vger.kernel.org # v5.10+ Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09fuse: Fix crash in fuse_dentry_automount() error pathGreg Kurz
If fuse_fill_super_submount() returns an error, the error path triggers a crash: [ 26.206673] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] [ 26.226362] RIP: 0010:__list_del_entry_valid+0x25/0x90 [...] [ 26.247938] Call Trace: [ 26.248300] fuse_mount_remove+0x2c/0x70 [fuse] [ 26.248892] virtio_kill_sb+0x22/0x160 [virtiofs] [ 26.249487] deactivate_locked_super+0x36/0xa0 [ 26.250077] fuse_dentry_automount+0x178/0x1a0 [fuse] The crash happens because fuse_mount_remove() assumes that the FUSE mount was already added to list under the FUSE connection, but this only done after fuse_fill_super_submount() has returned success. This means that until fuse_fill_super_submount() has returned success, the FUSE mount isn't actually owned by the superblock. We should thus reclaim ownership by clearing sb->s_fs_info, which will skip the call to fuse_mount_remove(), and perform rollback, like virtio_fs_get_tree() already does for the root sb. Fixes: bf109c64040f ("fuse: implement crossmounts") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-27Merge branch 'miklos.fileattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ...
2021-04-12fuse: convert to fileattrMiklos Szeredi
Since fuse just passes ioctl args through to/from server, converting to the fileattr API is more involved, than most other filesystems. Both .fileattr_set() and .fileattr_get() need to obtain an open file to operate on. The simplest way is with the following sequence: FUSE_OPEN FUSE_IOCTL FUSE_RELEASE If this turns out to be a performance problem, it could be optimized for the case when there's already a file (any file) open for the inode. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-12fuse: unsigned open flagsMiklos Szeredi
Release helpers used signed int. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-03-08new helper: inode_wrong_type()Al Viro
inode_wrong_type(inode, mode) returns true if setting inode->i_mode to given value would've changed the inode type. We have enough of those checks open-coded to make a helper worthwhile. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: make permission helpers idmapped mount awareChristian Brauner
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-10fuse: fix bad inodeMiklos Szeredi
Jan Kara's analysis of the syzbot report (edited): The reproducer opens a directory on FUSE filesystem, it then attaches dnotify mark to the open directory. After that a fuse_do_getattr() call finds that attributes returned by the server are inconsistent, and calls make_bad_inode() which, among other things does: inode->i_mode = S_IFREG; This then confuses dnotify which doesn't tear down its structures properly and eventually crashes. Avoid calling make_bad_inode() on a live inode: switch to a private flag on the fuse inode. Also add the test to ops which the bad_inode_ops would have caught. This bug goes back to the initial merge of fuse in 2.6.14... Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org>
2020-11-11fuse: add a flag FUSE_OPEN_KILL_SUIDGID for open() requestVivek Goyal
With FUSE_HANDLE_KILLPRIV_V2 support, server will need to kill suid/sgid/ security.capability on open(O_TRUNC), if server supports FUSE_ATOMIC_O_TRUNC. But server needs to kill suid/sgid only if caller does not have CAP_FSETID. Given server does not have this information, client needs to send this info to server. So add a flag FUSE_OPEN_KILL_SUIDGID to fuse_open_in request which tells server to kill suid/sgid (only if group execute is set). This flag is added to the FUSE_OPEN request, as well as the FUSE_CREATE request if the create was non-exclusive, since that might result in an existing file being opened/truncated. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: don't send ATTR_MODE to kill suid/sgid for handle_killpriv_v2Vivek Goyal
If client does a write() on a suid/sgid file, VFS will first call fuse_setattr() with ATTR_KILL_S[UG]ID set. This requires sending setattr to file server with ATTR_MODE set to kill suid/sgid. But to do that client needs to know latest mode otherwise it is racy. To reduce the race window, current code first call fuse_do_getattr() to get latest ->i_mode and then resets suid/sgid bits and sends rest to server with setattr(ATTR_MODE). This does not reduce the race completely but narrows race window significantly. With fc->handle_killpriv_v2 enabled, it should be possible to remove this race completely. Do not kill suid/sgid with ATTR_MODE at all. It will be killed by server when WRITE request is sent to server soon. This is similar to fc->handle_killpriv logic. V2 is just more refined version of protocol. Hence this patch does not send ATTR_MODE to kill suid/sgid if fc->handle_killpriv_v2 is enabled. This creates an issue if fc->writeback_cache is enabled. In that case WRITE can be cached in guest and server might not see WRITE request and hence will not kill suid/sgid. Miklos suggested that in such cases, we should fallback to a writethrough WRITE instead and that will generate WRITE request and kill suid/sgid. This patch implements that too. But this relies on client seeing the suid/sgid set. If another client sets suid/sgid and this client does not see it immideately, then we will not fallback to writethrough WRITE. So this is one limitation with both fc->handle_killpriv_v2 and fc->writeback_cache enabled. Both the options are not fully compatible. But might be good enough for many use cases. Note: This patch is not checking whether security.capability is set or not when falling back to writethrough path. If suid/sgid is not set and only security.capability is set, that will be taken care of by file_remove_privs() call in ->writeback_cache path. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: setattr should set FATTR_KILL_SUIDGIDVivek Goyal
If fc->handle_killpriv_v2 is enabled, we expect file server to clear suid/sgid/security.capbility upon chown/truncate/write as appropriate. Upon truncate (ATTR_SIZE), suid/sgid are cleared only if caller does not have CAP_FSETID. File server does not know whether caller has CAP_FSETID or not. Hence set FATTR_KILL_SUIDGID upon truncate to let file server know that caller does not have CAP_FSETID and it should kill suid/sgid as appropriate. On chown (ATTR_UID/ATTR_GID) suid/sgid need to be cleared irrespective of capabilities of calling process, so set FATTR_KILL_SUIDGID unconditionally in that case. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: always revalidate if exclusive createMiklos Szeredi
Failure to do so may result in EEXIST even if the file only exists in the cache and not in the filesystem. The atomic nature of O_EXCL mandates that the cached state should be ignored and existence verified anew. Reported-by: Ken Schalk <kschalk@nvidia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: get rid of fuse_mount refcountMiklos Szeredi
Fuse mount now only ever has a refcount of one (before being freed) so the count field is unnecessary. Remove the refcounting and fold fuse_mount_put() into callers. The only caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount() and here the fuse_conn_put() can simply be omitted. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-09fuse: implement crossmountsMax Reitz
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the .d_automount implementation creates a new submount at that location, so that the submount gets a distinct st_dev value. Note that all submounts get a distinct superblock and a distinct st_dev value, so for virtio-fs, even if the same filesystem is mounted more than once on the host, none of its mount points will have the same st_dev. We need distinct superblocks because the superblock points to the root node, but the different host mounts may show different trees (e.g. due to submounts in some of them, but not in others). Right now, this behavior is only enabled when fuse_conn.auto_submounts is set, which is the case only for virtio-fs. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: split fuse_mount off of fuse_connMax Reitz
We want to allow submounts for the same fuse_conn, but with different superblocks so that each of the submounts has its own device ID. To do so, we need to split all mount-specific information off of fuse_conn into a new fuse_mount structure, so that multiple mounts can share a single fuse_conn. We need to take care only to perform connection-level actions once (i.e. when the fuse_conn and thus the first fuse_mount are established, or when the last fuse_mount and thus the fuse_conn are destroyed). For example, fuse_sb_destroy() must invoke fuse_send_destroy() until the last superblock is released. To do so, we keep track of which fuse_mount is the root mount and perform all fuse_conn-level actions only when this fuse_mount is involved. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: serialize truncate/punch_hole and dax fault pathVivek Goyal
Currently in fuse we don't seem have any lock which can serialize fault path with truncate/punch_hole path. With dax support I need one for following reasons. 1. Dax requirement DAX fault code relies on inode size being stable for the duration of fault and want to serialize with truncate/punch_hole and they explicitly mention it. static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, const struct iomap_ops *ops) /* * Check whether offset isn't beyond end of file now. Caller is * supposed to hold locks serializing us with truncate / punch hole so * this is a reliable test. */ max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2. Make sure there are no users of pages being truncated/punch_hole get_user_pages() might take references to page and then do some DMA to said pages. Filesystem might truncate those pages without knowing that a DMA is in progress or some I/O is in progress. So use dax_layout_busy_page() to make sure there are no such references and I/O is not in progress on said pages before moving ahead with truncation. 3. Limitation of kvm page fault error reporting If we are truncating file on host first and then removing mappings in guest lateter (truncate page cache etc), then this could lead to a problem with KVM. Say a mapping is in place in guest and truncation happens on host. Now if guest accesses that mapping, then host will take a fault and kvm will either exit to qemu or spin infinitely. IOW, before we do truncation on host, we need to make sure that guest inode does not have any mapping in that region or whole file. 4. virtiofs memory range reclaim Soon I will introduce the notion of being able to reclaim dax memory ranges from a fuse dax inode. There also I need to make sure that no I/O or fault is going on in the reclaimed range and nobody is using it so that range can be reclaimed without issues. Currently if we take inode lock, that serializes read/write. But it does not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem for this purpose. It can be used to serialize with faults. As of now, I am adding taking this semaphore only in dax fault path and not regular fault path because existing code does not have one. May be existing code can benefit from it as well to take care of some races, but that we can fix later if need be. For now, I am just focussing only on DAX path which is new path. Also added logic to take fuse_inode->i_mmap_sem in truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and fuse dax fault are mutually exlusive and avoid all the above problems. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-19fuse: always allow query of st_devMiklos Szeredi
Fuse mounts without "allow_other" are off-limits to all non-owners. Yet it makes sense to allow querying st_dev on the root, since this value is provided by the kernel, not the userspace filesystem. Allow statx(2) with a zero request mask to succeed on a fuse mounts for all users. Reported-by: Nikolaus Rath <Nikolaus@rath.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06fuse: Support RENAME_WHITEOUT flagVivek Goyal
Allow fuse to pass RENAME_WHITEOUT to fuse server. Overlayfs on top of virtiofs uses RENAME_WHITEOUT. Without this patch renaming a directory in overlayfs (dir is on lower) fails with -EINVAL. With this patch it works. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12fuse: verify nlinkMiklos Szeredi
When adding a new hard link, make sure that i_nlink doesn't overflow. Fixes: ac45d61357e8 ("fuse: fix nlink after unlink") Cc: <stable@vger.kernel.org> # v3.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12fuse: verify attributesMiklos Szeredi
If a filesystem returns negative inode sizes, future reads on the file were causing the cpu to spin on truncate_pagecache. Create a helper to validate the attributes. This now does two things: - check the file mode - check if the file size fits in i_size without overflowing Reported-by: Arijit Banerjee <arijit@rubrik.com> Fixes: d8a5ba45457e ("[PATCH] FUSE - core") Cc: <stable@vger.kernel.org> # v2.6.14 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>