summaryrefslogtreecommitdiff
path: root/fs/internal.h
AgeCommit message (Collapse)Author
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-02-21Merge tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Highlights from this cycles are things like request recycling and task_work optimizations, which net us anywhere from 10-20% of speedups on workloads that mostly are inline. This work was originally done to put io_uring under memcg, which adds considerable overhead. But it's a really nice win as well. Also worth highlighting is the LOOKUP_CACHED work in the VFS, and using it in io_uring. Greatly speeds up the fast path for file opens. Summary: - Put io_uring under memcg protection. We accounted just the rings themselves under rlimit memlock before, now we account everything. - Request cache recycling, persistent across invocations (Pavel, me) - First part of a cleanup/improvement to buffer registration (Bijan) - SQPOLL fixes (Hao) - File registration NULL pointer fixup (Dan) - LOOKUP_CACHED support for io_uring - Disable /proc/thread-self/ for io_uring, like we do for /proc/self - Add Pavel to the io_uring MAINTAINERS entry - Tons of code cleanups and optimizations (Pavel) - Support for skip entries in file registration (Noah)" * tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block: (103 commits) io_uring: tctx->task_lock should be IRQ safe proc: don't allow async path resolution of /proc/thread-self components io_uring: kill cached requests from exiting task closing the ring io_uring: add helper to free all request caches io_uring: allow task match to be passed to io_req_cache_free() io-wq: clear out worker ->fs and ->files io_uring: optimise io_init_req() flags setting io_uring: clean io_req_find_next() fast check io_uring: don't check PF_EXITING from syscall io_uring: don't split out consume out of SQE get io_uring: save ctx put/get for task_work submit io_uring: don't duplicate io_req_task_queue() io_uring: optimise SQPOLL mm/files grabbing io_uring: optimise out unlikely link queue io_uring: take compl state from submit state io_uring: inline io_complete_rw_common() io_uring: move res check out of io_rw_reissue() io_uring: simplify iopoll reissuing io_uring: clean up io_req_free_batch_finish() io_uring: move submit side state closer in the ring ...
2021-02-01fs: provide locked helper variant of close_fd_get_file()Jens Axboe
Assumes current->files->file_lock is already held on invocation. Helps the caller check the file before removing the fd, if it needs to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25teach sendfile(2) to handle send-to-pipe directlyAl Viro
no point going through the intermediate pipe Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-24namei: handle idmapped mounts in may_*() helpersChristian Brauner
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-09fs: make do_renameat2() take struct filenameJens Axboe
Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This behaves like do_unlinkat(), which also takes a filename struct and puts it when it is done. Converting callers is then trivial. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01block: remove i_bdevChristoph Hellwig
Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01fs: remove get_super_thawed and get_super_exclusive_thawedChristoph Hellwig
Just open code the wait in the only caller of both functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-22fs: remove compat_sys_mountChristoph Hellwig
compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-07Merge branch 'hch.init_path' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull init and set_fs() cleanups from Al Viro: "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series" * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits) init: add an init_dup helper init: add an init_utimes helper init: add an init_stat helper init: add an init_mknod helper init: add an init_mkdir helper init: add an init_symlink helper init: add an init_link helper init: add an init_eaccess helper init: add an init_chmod helper init: add an init_chown helper init: add an init_chroot helper init: add an init_chdir helper init: add an init_rmdir helper init: add an init_unlink helper init: add an init_umount helper init: add an init_mount helper init: mark create_dev as __init init: mark console_on_rootfs as __init init: initialize ramdisk_execute_command at compile time devtmpfs: refactor devtmpfsd() ...
2020-07-31init: add an init_mknod helperChristoph Hellwig
Add a simple helper to mknod with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_mknod. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_mkdir helperChristoph Hellwig
Add a simple helper to mkdir with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_mkdir. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_symlink helperChristoph Hellwig
Add a simple helper to symlink with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_symlink. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_link helperChristoph Hellwig
Add a simple helper to link with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_link. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_chmod helperChristoph Hellwig
Add a simple helper to chmod with a kernel space file name and switch the early init code over to it. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_chown helperChristoph Hellwig
Add a simple helper to chown with a kernel space file name and switch the early init code over to it. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_umount helperChristoph Hellwig
Like ksys_umount, but takes a kernel pointer for the destination path. Switch over the umount in the init code, which just happen to work due to the implicit set_fs(KERNEL_DS) during early init right now. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: add an init_mount helperChristoph Hellwig
Like do_mount, but takes a kernel pointer for the destination path. Switch over the mounts in the init code and devtmpfs to it, which just happen to work due to the implicit set_fs(KERNEL_DS) during early init right now. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31fs: push the getname from do_rmdir into the callersChristoph Hellwig
This mirrors do_unlinkat and will make life a little easier for the init code to reuse the whole function with a kernel filename. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-24block: move block-related definitions out of fs.hChristoph Hellwig
Move most of the block related definition out of fs.h into more suitable headers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-05Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of bug fixes and cleanups for ext4, including: - Fix performance problems found in dioread_nolock now that it is the default, caused by transaction leaks. - Clean up fiemap handling in ext4 - Clean up and refactor multiple block allocator (mballoc) code - Fix a problem with mballoc with a smaller file systems running out of blocks because they couldn't properly use blocks that had been reserved by inode preallocation. - Fixed a race in ext4_sync_parent() versus rename() - Simplify the error handling in the extent manipulation code - Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers. - Avoid passing an error pointer to brelse in ext4_xattr_set() - Fix race which could result to freeing an inode on the dirty last in data=journal mode. - Fix refcount handling if ext4_iget() fails - Fix a crash in generic/019 caused by a corrupted extent node" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: avoid unnecessary transaction starts during writeback ext4: don't block for O_DIRECT if IOCB_NOWAIT is set ext4: remove the access_ok() check in ext4_ioctl_get_es_cache fs: remove the access_ok() check in ioctl_fiemap fs: handle FIEMAP_FLAG_SYNC in fiemap_prep fs: move fiemap range validation into the file systems instances iomap: fix the iomap_fiemap prototype fs: move the fiemap definitions out of fs.h fs: mark __generic_block_fiemap static ext4: remove the call to fiemap_check_flags in ext4_fiemap ext4: split _ext4_fiemap ext4: fix fiemap size checks for bitmap files ext4: fix EXT4_MAX_LOGICAL_BLOCK macro add comment for ext4_dir_entry_2 file_type member jbd2: avoid leaking transaction credits when unreserving handle ext4: drop ext4_journal_free_reserved() ext4: mballoc: use lock for checking free blocks while retrying ext4: mballoc: refactor ext4_mb_good_group() ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling ext4: mballoc: refactor ext4_mb_discard_preallocations() ...
2020-06-03writeback: Export inode_io_list_del()Jan Kara
Ext4 needs to remove inode from writeback lists after it is out of visibility of its journalling machinery (which can still dirty the inode). Export inode_io_list_del() for it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-02Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "A relatively quiet round, mostly just fixes and code improvements. In particular: - Make statx just use the generic statx handler, instead of open coding it. We don't need that anymore, as we always call it async safe (Bijan) - Enable closing of the ring itself. Also fixes O_PATH closure (me) - Properly name completion members (me) - Batch reap of dead file registrations (me) - Allow IORING_OP_POLL with double waitqueues (me) - Add tee(2) support (Pavel) - Remove double off read (Pavel) - Fix overflow cancellations (Pavel) - Improve CQ timeouts (Pavel) - Async defer drain fixes (Pavel) - Add support for enabling/disabling notifications on a registered eventfd (Stefano) - Remove dead state parameter (Xiaoguang) - Disable SQPOLL submit on dying ctx (Xiaoguang) - Various code cleanups" * tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits) io_uring: fix overflowed reqs cancellation io_uring: off timeouts based only on completions io_uring: move timeouts flushing to a helper statx: hide interfaces no longer used by io_uring io_uring: call statx directly statx: allow system call to be invoked from io_uring io_uring: add io_statx structure io_uring: get rid of manual punting in io_close io_uring: separate DRAIN flushing into a cold path io_uring: don't re-read sqe->off in timeout_prep() io_uring: simplify io_timeout locking io_uring: fix flush req->refs underflow io_uring: don't submit sqes when ctx->refs is dying io_uring: async task poll trigger cleanup io_uring: add tee(2) support splice: export do_tee() io_uring: don't repeat valid flag list io_uring: rename io_file_put() io_uring: remove req->needs_fixed_files io_uring: cleanup io_poll_remove_one() logic ...
2020-05-26statx: hide interfaces no longer used by io_uringBijan Mottahedeh
The io_uring interfaces have been replaced by do_statx() and are no longer needed. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26statx: allow system call to be invoked from io_uringBijan Mottahedeh
This is a prepatory patch to allow io_uring to invoke statx directly. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14vfs: add faccessat2 syscallMiklos Szeredi
POSIX defines faccessat() as having a fourth "flags" argument, while the linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken. Add a new faccessat(2) syscall with the added flags argument and implement both flags. The value of AT_EACCESS is defined in glibc headers to be the same as AT_REMOVEDIR. Use this value for the kernel interface as well, together with the explanatory comment. Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can be useful and is trivial to implement. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-04-02Merge branch 'work.dotdot1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pathwalk sanitizing from Al Viro: "Massive pathwalk rewrite and cleanups. Several iterations have been posted; hopefully this thing is getting readable and understandable now. Pretty much all parts of pathname resolutions are affected... The branch is identical to what has sat in -next, except for commit message in "lift all calls of step_into() out of follow_dotdot/ follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only commit message changed there." * 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits) lookup_open(): don't bother with fallbacks to lookup+create atomic_open(): no need to pass struct open_flags anymore open_last_lookups(): move complete_walk() into do_open() open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open() open_last_lookups(): don't abuse complete_walk() when all we want is unlazy open_last_lookups(): consolidate fsnotify_create() calls take post-lookup part of do_last() out of loop link_path_walk(): sample parent's i_uid and i_mode for the last component __nd_alloc_stack(): make it return bool reserve_stack(): switch to __nd_alloc_stack() pick_link(): take reserving space on stack into a new helper pick_link(): more straightforward handling of allocation failures fold path_to_nameidata() into its only remaining caller pick_link(): pass it struct path already with normal refcounting rules fs/namei.c: kill follow_mount() non-RCU analogue of the previous commit helper for mount rootwards traversal follow_dotdot(): be lazy about changing nd->path follow_dotdot_rcu(): be lazy about changing nd->path follow_dotdot{,_rcu}(): massage loops ...
2020-03-25block: move guard_bio_eod to bio.cChristoph Hellwig
This is bio layer functionality and not related to buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-13LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()Al Viro
New LOOKUP flag, telling path_lookupat() to act as path_mountpointat(). IOW, traverse mounts at the final point and skip revalidation of the location where it ends up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-29Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: - Support for various new opcodes (fallocate, openat, close, statx, fadvise, madvise, openat2, non-vectored read/write, send/recv, and epoll_ctl) - Faster ring quiesce for fileset updates - Optimizations for overflow condition checking - Support for max-sized clamping - Support for probing what opcodes are supported - Support for io-wq backend sharing between "sibling" rings - Support for registering personalities - Lots of little fixes and improvements * tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits) io_uring: add support for epoll_ctl(2) eventpoll: support non-blocking do_epoll_ctl() calls eventpoll: abstract out epoll_ctl() handler io_uring: fix linked command file table usage io_uring: support using a registered personality for commands io_uring: allow registering credentials io_uring: add io-wq workqueue sharing io-wq: allow grabbing existing io-wq io_uring/io-wq: don't use static creds/mm assignments io-wq: make the io_wq ref counted io_uring: fix refcounting with batched allocations at OOM io_uring: add comment for drain_next io_uring: don't attempt to copy iovec for READ/WRITE io_uring: honor IOSQE_ASYNC for linked reqs io_uring: prep req when do IOSQE_ASYNC io_uring: use labeled array init in io_op_defs io_uring: optimise sqe-to-req flags translation io_uring: remove REQ_F_IO_DRAINED io_uring: file switch work needs to get flushed on exit io_uring: hide uring_fd in ctx ...
2020-01-29Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This series is slightly unusual because it includes Arnd's compat ioctl tree here: 1c46a2cf2dbd Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue Excluding Arnd's changes, this is mostly an update of the usual drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas. There are a couple of core and base updates around error propagation and atomicity in the attribute container base we use for the SCSI transport classes. The rest is minor changes and updates" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits) scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity scsi: hisi_sas: Modify the file permissions of trigger_dump to write only scsi: hisi_sas: Replace magic number when handle channel interrupt scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock scsi: hisi_sas: use threaded irq to process CQ interrupts scsi: ufs: Use UFS device indicated maximum LU number scsi: ufs: Add max_lu_supported in struct ufs_dev_info scsi: ufs: Delete is_init_prefetch from struct ufs_hba scsi: ufs: Inline two functions into their callers scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init() scsi: ufs: Split ufshcd_probe_hba() based on its called flow scsi: ufs: Delete struct ufs_dev_desc scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails scsi: ufs-mediatek: enable low-power mode for hibern8 state scsi: ufs: export some functions for vendor usage scsi: ufs-mediatek: add dbg_register_dump implementation scsi: qla2xxx: Fix a NULL pointer dereference in an error path scsi: qla1280: Make checking for 64bit support consistent scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1 ...
2020-01-20fs: make two stat prep helpers availableJens Axboe
To implement an async stat, we need to provide the flags mapping and the statx user copy. Make them available internally, through fs/internal.h. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20fs: make build_open_flags() available internallyJens Axboe
This is a prep patch for supporting non-blocking open from io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-09fs: move guard_bio_eod() after bio_set_op_attrsMing Lei
Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod") adds bio_truncate() for handling bio EOD. However, bio_truncate() doesn't use the passed 'op' parameter from guard_bio_eod's callers. So bio_trunacate() may retrieve wrong 'op', and zering pages may not be done for READ bio. Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs() in submit_bh_wbc() so that bio_truncate() can always retrieve correct op info. Meantime remove the 'op' parameter from guard_bio_eod() because it isn't used any more. Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod") Signed-off-by: Ming Lei <ming.lei@redhat.com> Fold in kerneldoc and bio_op() change. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-03compat_ioctl: simplify the implementationArnd Bergmann
Now that both native and compat ioctl syscalls are in the same file, a couple of simplifications can be made, bringing the implementation closer together: - do_vfs_ioctl(), ioctl_preallocate(), and compat_ioctl_preallocate() can become static, allowing the compiler to optimize better - slightly update the coding style for consistency between the functions. - rather than listing each command in two switch statements for the compat case, just call a single function that has all the common commands. As a side-effect, FS_IOC_RESVSP/FS_IOC_RESVSP64 are now available to x86 compat tasks, along with FS_IOC_RESVSP_32/FS_IOC_RESVSP64_32. This is harmless for i386 emulation, and can be considered a bugfix for x32 emulation, which never supported these in the past. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-25make __d_alloc() staticAl Viro
no users outside of fs/dcache.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-20Merge branch 'work.dcache2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache and mountpoint updates from Al Viro: "Saner handling of refcounts to mountpoints. Transfer the counting reference from struct mount ->mnt_mountpoint over to struct mountpoint ->m_dentry. That allows us to get rid of the convoluted games with ordering of mount shutdowns. The cost is in teaching shrink_dcache_{parent,for_umount} to cope with mixed-filesystem shrink lists, which we'll also need for the Slab Movable Objects patchset" * 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch the remnants of releasing the mountpoint away from fs_pin get rid of detach_mnt() make struct mountpoint bear the dentry reference to mountpoint, not struct mount Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt() __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore nfs: dget_parent() never returns NULL ceph: don't open-code the check for dead lockref
2019-07-19Merge tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap split/cleanup from Darrick Wong: "As promised, here's the second part of the iomap merge for 5.3, in which we break up iomap.c into smaller files grouped by functional area so that it'll be easier in the long run to maintain cohesiveness of code units and to review incoming patches. There are no functional changes and fs/iomap.c split cleanly. Summary: - Regroup the fs/iomap.c code by major functional area so that we can start development for 5.4 from a more stable base" * tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: move internal declarations into fs/iomap/ iomap: move the main iteration code into a separate file iomap: move the buffered IO code into a separate file iomap: move the direct IO code into a separate file iomap: move the SEEK_HOLE code into a separate file iomap: move the file mapping reporting code into a separate file iomap: move the swapfile code into a separate file iomap: start moving code to fs/iomap/
2019-07-19Merge branch 'work.mount0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) mnt_init(): call shmem_init() unconditionally constify ksys_mount() string arguments don't bother with registering rootfs init_rootfs(): don't bother with init_ramfs_fs() vfs: Convert smackfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert binfmt_misc to use the new mount API convenience helper: get_tree_single() convenience helper get_tree_nodev() vfs: Kill sget_userns() ...
2019-07-17iomap: move internal declarations into fs/iomap/Darrick J. Wong
Move internal function declarations out of fs/internal.h into include/linux/iomap.h so that our transition is complete. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-10Teach shrink_dcache_parent() to cope with mixed-filesystem shrink listsAl Viro
Currently, running into a shrink list that contains dentries from different filesystems can cause several unpleasant things for shrink_dcache_parent() and for umount(2). The first problem is that there's a window during shrink_dentry_list() between __dentry_kill() takes a victim out and dropping reference to its parent. During that window the parent looks like a genuine busy dentry. shrink_dcache_parent() (or, worse yet, shrink_dcache_for_umount()) coming at that time will see no eviction candidates and no indication that it needs to wait for some shrink_dentry_list() to proceed further. That applies for any shrink list that might intersect with the subtree we are trying to shrink; the only reason it does not blow on umount(2) in the mainline is that we unregister the memory shrinker before hitting shrink_dcache_for_umount(). Another problem happens if something in a mixed-filesystem shrink list gets be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for the stuck shrinker to get around to our dentries. Solution: 1) have shrink_dentry_list() decrement the parent's refcount and make sure it's on a shrink list (ours unless it already had been on some other) before calling __dentry_kill(). That eliminates the window when shrink_dcache_parent() would've blown past the entire subtree without noticing anything with zero refcount not on shrink lists. 2) when shrink_dcache_parent() has found no eviction candidates, but some dentries are still sitting on shrink lists, rather than repeating the scan in hope that shrinkers have progressed, scan looking for something on shrink lists with zero refcount. If such a thing is found, grab rcu_read_lock() and stop the scan, with caller locking it for eviction, dropping out of RCU and doing __dentry_kill(), with the same treatment for parent as shrink_dentry_list() would do. Note that right now mixed-filesystem shrink lists do not occur, so this is not a mainline bug. Howevere, there's a bunch of uses for such beasts (e.g. the "try and evict everything we can out of given page" patches; there are potential uses in mount-related code, considerably simplifying the life in fs/namespace.c, etc.) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-27fs: fold __generic_write_end back into generic_write_endChristoph Hellwig
This effectively reverts a6d639da63ae ("fs: factor out a __generic_write_end helper") as we now open code what is left of that helper in iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25switch mount_capable() to fs_contextAl Viro
now both callers of mount_capable() have access to fs_context; the only difference is that for sget_fc() we have the possibility of fc->global being true, while for legacy_get_tree() it's guaranteed to be impossible. Unify to more generic variant... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25move the capability checks from sget_userns() to legacy_get_tree()Al Viro
1) all call chains leading to sget_userns() pass through ->mount() instances. 2) none of ->mount() instances is ever called directly - the only call site is legacy_get_tree() 3) all remaining ->mount() instances end up calling sget_userns() IOW, we might as well do the capability checks just before calling ->mount(). As for the arguments passed to mount_capable(), in case of call chains to sget_userns() going through sget(), we either don't call mount_capable() at all, or pass current_user_ns() to it. The call chains going through mount_pseudo_xattr() don't call mount_capable() at all (SB_KERNMOUNT in flags on those). That could've been split into smaller steps (lifting the checks into sget(), then callers of sget(), then all the way to the entries of every ->mount() out there, then to the sole caller), but that would be too much churn for little benefit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-21unexport simple_dname()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-07Merge branch 'work.mount-syscalls' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount ABI updates from Al Viro: "The syscalls themselves, finally. That's not all there is to that stuff, but switching individual filesystems to new methods is fortunately independent from everything else, so e.g. NFS series can go through NFS tree, etc. As those conversions get done, we'll be finally able to get rid of a bunch of duplication in fs/super.c introduced in the beginning of the entire thing. I expect that to be finished in the next window..." * 'work.mount-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Add a sample program for the new mount API vfs: syscall: Add fspick() to select a superblock for reconfiguration vfs: syscall: Add fsmount() to create a mount for a superblock vfs: syscall: Add fsconfig() for configuring and managing a context vfs: Implement logging through fs_context vfs: syscall: Add fsopen() to prepare for superblock creation Make anon_inodes unconditional teach move_mount(2) to work with OPEN_TREE_CLONE vfs: syscall: Add move_mount(2) to move mounts around vfs: syscall: Add open_tree(2) to reference or clone a mount
2019-05-07Merge branch 'work.dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc dcache updates from Al Viro: "Most of this pile is putting name length into struct name_snapshot and making use of it. The beginning of this series ("ovl_lookup_real_one(): don't bother with strlen()") ought to have been split in two (separate switch of name_snapshot to struct qstr from overlayfs reaping the trivial benefits of that), but I wanted to avoid a rebase - by the time I'd spotted that it was (a) in -next and (b) close to 5.1-final ;-/" * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: audit_compare_dname_path(): switch to const struct qstr * audit_update_watch(): switch to const struct qstr * inotify_handle_event(): don't bother with strlen() fsnotify: switch send_to_group() and ->handle_event to const struct qstr * fsnotify(): switch to passing const struct qstr * for file_name switch fsnotify_move() to passing const struct qstr * for old_name ovl_lookup_real_one(): don't bother with strlen() sysv: bury the broken "quietly truncate the long filenames" logics nsfs: unobfuscate unexport d_alloc_pseudo()
2019-05-07Merge tag 'iomap-5.2-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "Nothing particularly exciting here, just adding some callouts for gfs2 and cleaning a few things. Summary: - Add some extra hooks to the iomap buffered write path to enable gfs2 journalled writes - SPDX conversion - Various refactoring" * tag 'iomap-5.2-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: move iomap_read_inline_data around iomap: Add a page_prepare callback iomap: Fix use-after-free error in page_done callback fs: Turn __generic_write_end into a void function iomap: Clean up __generic_write_end calling iomap: convert to SPDX identifier