summaryrefslogtreecommitdiff
path: root/fs/proc/proc_sysctl.c
AgeCommit message (Collapse)Author
2017-01-10sysctl: Drop reference added by grab_header in proc_sys_readdirZhou Chengming
Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. It can cause any path called unregister_sysctl_table will wait forever. The calltrace of CVE-2016-9191: [ 5535.960522] Call Trace: [ 5535.963265] [<ffffffff817cdaaf>] schedule+0x3f/0xa0 [ 5535.968817] [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0 [ 5535.975346] [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130 [ 5535.982256] [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130 [ 5535.988972] [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80 [ 5535.994804] [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0 [ 5536.001227] [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0 [ 5536.007648] [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0 [ 5536.014654] [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0 [ 5536.021657] [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40 [ 5536.029344] [<ffffffff810d7704>] partition_sched_domains+0x44/0x450 [ 5536.036447] [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0 [ 5536.043844] [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0 [ 5536.051336] [<ffffffff8116789d>] update_flag+0x11d/0x210 [ 5536.057373] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.064186] [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60 [ 5536.070899] [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10 [ 5536.077420] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.084234] [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220 [ 5536.091049] [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60 [ 5536.097571] [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220 [ 5536.104207] [<ffffffff810bc83f>] process_one_work+0x1df/0x710 [ 5536.110736] [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710 [ 5536.117461] [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0 [ 5536.123697] [<ffffffff810bcd70>] ? process_one_work+0x710/0x710 [ 5536.130426] [<ffffffff810c3f7e>] kthread+0xfe/0x120 [ 5536.135991] [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40 [ 5536.142041] [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230 One cgroup maintainer mentioned that "cgroup is trying to offline a cpuset css, which takes place under cgroup_mutex. The offlining ends up trying to drain active usages of a sysctl table which apprently is not happening." The real reason is that proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. So this cpuset offline path will wait here forever. See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13 Fixes: f0c3b5093add ("[readdir] convert procfs") Cc: stable@vger.kernel.org Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: Yang Shukui <yangshukui@huawei.com> Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-10-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning
2016-10-10Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted misc bits and pieces. There are several single-topic branches left after this (rename2 series from Miklos, current_time series from Deepa Dinamani, xattr series from Andreas, uaccess stuff from from me) and I'd prefer to send those separately" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits) proc: switch auxv to use of __mem_open() hpfs: support FIEMAP cifs: get rid of unused arguments of CIFSSMBWrite() posix_acl: uapi header split posix_acl: xattr representation cleanups fs/aio.c: eliminate redundant loads in put_aio_ring_file fs/internal.h: add const to ns_dentry_operations declaration compat: remove compat_printk() fs/buffer.c: make __getblk_slow() static proc: unsigned file descriptors fs/file: more unsigned file descriptors fs: compat: remove redundant check of nr_segs cachefiles: Fix attempt to read i_blocks after deleting file [ver #2] cifs: don't use memcpy() to copy struct iov_iter get rid of separate multipage fault-in primitives fs: Avoid premature clearing of capabilities fs: Give dentry to inode_change_ok() instead of inode fuse: Propagate dentry down to inode_change_ok() ceph: Propagate dentry down to inode_change_ok() xfs: Propagate dentry down to inode_change_ok() ...
2016-10-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This set of changes is a number of smaller things that have been overlooked in other development cycles focused on more fundamental change. The devpts changes are small things that were a distraction until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an trivial regression fix to autofs for the unprivileged mount changes that went in last cycle. A pair of ioctls has been added by Andrey Vagin making it is possible to discover the relationships between namespaces when referring to them through file descriptors. The big user visible change is starting to add simple resource limits to catch programs that misbehave. With namespaces in general and user namespaces in particular allowing users to use more kinds of resources, it has become important to have something to limit errant programs. Because the purpose of these limits is to catch errant programs the code needs to be inexpensive to use as it always on, and the default limits need to be high enough that well behaved programs on well behaved systems don't encounter them. To this end, after some review I have implemented per user per user namespace limits, and use them to limit the number of namespaces. The limits being per user mean that one user can not exhause the limits of another user. The limits being per user namespace allow contexts where the limit is 0 and security conscious folks can remove from their threat anlysis the code used to manage namespaces (as they have historically done as it root only). At the same time the limits being per user namespace allow other parts of the system to use namespaces. Namespaces are increasingly being used in application sand boxing scenarios so an all or nothing disable for the entire system for the security conscious folks makes increasing use of these sandboxes impossible. There is also added a limit on the maximum number of mounts present in a single mount namespace. It is nontrivial to guess what a reasonable system wide limit on the number of mount structure in the kernel would be, especially as it various based on how a system is using containers. A limit on the number of mounts in a mount namespace however is much easier to understand and set. In most cases in practice only about 1000 mounts are used. Given that some autofs scenarious have the potential to be 30,000 to 50,000 mounts I have set the default limit for the number of mounts at 100,000 which is well above every known set of users but low enough that the mount hash tables don't degrade unreaonsably. These limits are a start. I expect this estabilishes a pattern that other limits for resources that namespaces use will follow. There has been interest in making inotify event limits per user per user namespace as well as interest expressed in making details about what is going on in the kernel more visible" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits) autofs: Fix automounts by using current_real_cred()->uid mnt: Add a per mount namespace limit on the number of mounts netns: move {inc,dec}_net_namespaces into #ifdef nsfs: Simplify __ns_get_path tools/testing: add a test to check nsfs ioctl-s nsfs: add ioctl to get a parent namespace nsfs: add ioctl to get an owning user namespace for ns file descriptor kernel: add a helper to get an owning user namespace for a namespace devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts devpts: Remove sync_filesystems devpts: Make devpts_kill_sb safe if fsi is NULL devpts: Simplify devpts_mount by using mount_nodev devpts: Move the creation of /dev/pts/ptmx into fill_super devpts: Move parse_mount_options into fill_super userns: When the per user per user namespace limit is reached return ENOSPC userns; Document per user per user namespace limits. mntns: Add a limit on the number of mount namespaces. netns: Add a limit on the number of net namespaces cgroupns: Add a limit on the number of cgroup namespaces ipcns: Add a limit on the number of ipc namespaces ...
2016-09-27fs: Replace CURRENT_TIME with current_time() for inode timestampsDeepa Dinamani
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_time() instead. CURRENT_TIME is also not y2038 safe. This is also in preparation for the patch that transitions vfs timestamps to use 64 bit time and hence make them y2038 safe. As part of the effort current_time() will be extended to do range checks. Hence, it is necessary for all file system timestamps to use current_time(). Also, current_time() will be transitioned along with vfs to be y2038 safe. Note that whenever a single call to current_time() is used to change timestamps in different inodes, it is because they share the same time granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felipe Balbi <balbi@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-22fs: Give dentry to inode_change_ok() instead of inodeJan Kara
inode_change_ok() will be resposible for clearing capabilities and IMA extended attributes and as such will need dentry. Give it as an argument to inode_change_ok() instead of an inode. Also rename inode_change_ok() to setattr_prepare() to better relect that it does also some modifications in addition to checks. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2016-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor overlapping changes for both merge conflicts. Resolution work done by Stephen Rothwell was used as a reference. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-14net: make net namespace sysctls belong to container's ownerDmitry Torokhov
If net namespace is attached to a user namespace let's make container's root owner of sysctls affecting said network namespace instead of global root. This also allows us to clean up net_ctl_permissions() because we do not need to fudge permissions anymore for the container's owner since it now owns the objects in question. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08sysctl: Stop implicitly passing current into sysctl_table_root.lookupEric W. Biederman
Passing nsproxy into sysctl_table_root.lookup was a premature optimization in attempt to avoid depending on current. The directory /proc/self/sys has not appeared and if and when it does this code will need to be reviewed closely and reworked anyway. So remove the premature optimization. Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-08-07Merge branch 'for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted cleanups and fixes. In the "trivial API change" department - ->d_compare() losing 'parent' argument" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cachefiles: Fix race between inactivating and culling a cache object 9p: use clone_fid() 9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()" vfs: make dentry_needs_remove_privs() internal vfs: remove file_needs_remove_privs() vfs: fix deadlock in file_remove_privs() on overlayfs get rid of 'parent' argument of ->d_compare() cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare() affs ->d_compare(): don't bother with ->d_inode fold _d_rehash() and __d_rehash() together fold dentry_rcuwalk_invalidate() into its only remaining caller
2016-08-06Merge branch 'work.const-qstr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull qstr constification updates from Al Viro: "Fairly self-contained bunch - surprising lot of places passes struct qstr * as an argument when const struct qstr * would suffice; it complicates analysis for no good reason. I'd prefer to feed that separately from the assorted fixes (those are in #for-linus and with somewhat trickier topology)" * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: qstr: constify instances in adfs qstr: constify instances in lustre qstr: constify instances in f2fs qstr: constify instances in ext2 qstr: constify instances in vfat qstr: constify instances in procfs qstr: constify instances in fuse qstr constify instances in fs/dcache.c qstr: constify instances in nfs qstr: constify instances in ocfs2 qstr: constify instances in autofs4 qstr: constify instances in hfs qstr: constify instances in hfsplus qstr: constify instances in logfs qstr: constify dentry_init_security
2016-07-31get rid of 'parent' argument of ->d_compare()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-30qstr: constify instances in procfsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-10vfs: make the string hashes salt the hashLinus Torvalds
We always mixed in the parent pointer into the dentry name hash, but we did it late at lookup time. It turns out that we can simplify that lookup-time action by salting the hash with the parent pointer early instead of late. A few other users of our string hashes also wanted to mix in their own pointers into the hash, and those are updated to use the same mechanism. Hash users that don't have any particular initial salt can just use the NULL pointer as a no-salt. Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-02switch all procfs directories ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02proc_sys_fill_cache(): switch to d_alloc_parallel()Al Viro
make it usable with directory locked shared Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-09-29fs: Drop unlikely before IS_ERR(_OR_NULL)Viresh Kumar
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there is no need to do that again from its callers. Drop it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Jeff Layton <jlayton@poochiereds.net> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-07-01sysctl: Allow creating permanently empty directories that serve as mountpoints.Eric W. Biederman
Add a magic sysctl table sysctl_mount_point that when used to create a directory forces that directory to be permanently empty. Update the code to use make_empty_dir_inode when accessing permanently empty directories. Update the code to not allow adding to permanently empty directories. Update /proc/sys/fs/binfmt_misc to be a permanently empty directory. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-08sysctl: remove typedef ctl_tableJoe Perches
Remove the final user, and the typedef itself. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-29Don't pass inode to ->d_hash() and ->d_compare()Linus Torvalds
Instances either don't look at it at all (the majority of cases) or only want it to find the superblock (which can be had as dentry->d_sb). A few cases that want more are actually safe with dentry->d_inode - the only precaution needed is the check that it hadn't been replaced with NULL by rmdir() or by overwriting rename(), which case should be simply treated as cache miss. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29[readdir] convert procfsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27fs/proc: clean up printksAndrew Morton
- use pr_foo() throughout - remove a couple of duplicated KERN_WARNINGs, via WARN(KERN_WARNING "...") - nuke a few warnings which I've never seen happen, ever. Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20procfs: drop vmtruncateMarco Stornelli
Removed vmtruncate Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-18sysctl: Pass useful parameters to sysctl permissionsEric W. Biederman
- Current is implicitly avaiable so passing current->nsproxy isn't useful. - The ctl_table_header is needed to find how the sysctl table is connected to the rest of sysctl. - ctl_table_root is avaiable in the ctl_table_header so no need to it. With these changes it becomes possible to write a version of net_sysctl_permission that takes into account the network namespace of the sysctl table, an important feature in extending the user namespace. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-09rbtree: fix incorrect rbtree node insertion in fs/proc/proc_sysctl.cMichel Lespinasse
The recently added code to use rbtrees in sysctl did not follow the proper rbtree interface on insertion - it was calling rb_link_node() which inserts a new node into the binary tree, but missed the call to rb_insert_color() which properly balances the rbtree and establishes all expected rbtree invariants. I found out about this only because faulty commit also used rb_init_node(), which I am removing within this patchset. But I think it's an easy mistake to make, and it makes me wonder if we should change the rbtree API so that insertions would be done with a single rb_insert() call (even if its implementation could still inline the rb_link_node() part and call a private __rb_insert_color function to do the rebalancing). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: empty nodes have no colorMichel Lespinasse
Empty nodes have no color. We can make use of this property to simplify the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also, we can get rid of the rb_init_node function which had been introduced by commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack allocated rb nodes") to avoid some issue with the empty node's color not being initialized. I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are doing there, though. axboe introduced them in commit 10fd48f2376d ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I see it, the 'empty node' abstraction is only used by rbtree users to flag nodes that they haven't inserted in any rbtree, so asking the predecessor or successor of such nodes doesn't make any sense. One final rb_init_node() caller was recently added in sysctl code to implement faster sysctl name lookups. This code doesn't make use of RB_EMPTY_NODE at all, and from what I could see it only called rb_init_node() under the mistaken assumption that such initialization was required before node insertion. [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06proc_sysctl.c: use BUG_ON instead of BUGPrasad Joshi
The use of if (!head) BUG(); can be replaced with the single line BUG_ON(!head). Signed-off-by: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17fs/proc: fix potential unregister_sysctl_table hangFrancesco Ruggeri
The unregister_sysctl_table() function hangs if all references to its ctl_table_header structure are not dropped. This can happen sometimes because of a leak in proc_sys_lookup(): proc_sys_lookup() gets a reference to the table via lookup_entry(), but it does not release it when a subsequent call to sysctl_follow_link() fails. This patch fixes this leak by making sure the reference is always dropped on return. See also commit 076c3eed2c31 ("sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry") which reorganized this code in 3.4. Tested in Linux 3.4.4. Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-14stop passing nameidata to ->lookup()Al Viro
Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14stop passing nameidata * to ->d_revalidate()Al Viro
Just the lookup flags. Die, bastard, die... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-15userns: Convert sysctl permission checks to use kuid and kgids.Eric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctlLinus Torvalds
Pull sysctl updates from Eric Biederman: - Rewrite of sysctl for speed and clarity. Insert/remove/Lookup in sysctl are all now O(NlogN) operations, and are no longer bottlenecks in the process of adding and removing network devices. sysctl is now focused on being a filesystem instead of system call and the code can all be found in fs/proc/proc_sysctl.c. Hopefully this means the code is now approachable. Much thanks is owed to Lucian Grinjincu for keeping at this until something was found that was usable. - The recent proc_sys_poll oops found by the fuzzer during hibernation is fixed. * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl: (36 commits) sysctl: protect poll() in entries that may go away sysctl: Don't call sysctl_follow_link unless we are a link. sysctl: Comments to make the code clearer. sysctl: Correct error return from get_subdir sysctl: An easier to read version of find_subdir sysctl: fix memset parameters in setup_sysctl_set() sysctl: remove an unused variable sysctl: Add register_sysctl for normal sysctl users sysctl: Index sysctl directories with rbtrees. sysctl: Make the header lists per directory. sysctl: Move sysctl_check_dups into insert_header sysctl: Modify __register_sysctl_paths to take a set instead of a root and an nsproxy sysctl: Replace root_list with links between sysctl_table_sets. sysctl: Add sysctl_print_dir and use it in get_subdir sysctl: Stop requiring explicit management of sysctl directories sysctl: Add a root pointer to ctl_table_set sysctl: Rewrite proc_sys_readdir in terms of first_entry and next_entry sysctl: Rewrite proc_sys_lookup introducing find_entry and lookup_entry. sysctl: Normalize the root_table data structure. sysctl: Factor out insert_header and erase_header ...
2012-03-22sysctl: protect poll() in entries that may go awayLucas De Marchi
Protect code accessing ctl_table by grabbing the header with grab_header() and after releasing with sysctl_head_finish(). This is needed if poll() is called in entries created by modules: currently only hostname and domainname support poll(), but this bug may be triggered when/if modules use it and if user called poll() in a file that doesn't support it. Dave Jones reported the following when using a syscall fuzzer while hibernating/resuming: RIP: 0010:[<ffffffff81233e3e>] [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90 RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000 RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940 [ ... ] Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83 7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16 b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89 If an entry goes away while we are polling() it, ctl_table may not exist anymore. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-14security: trim security.hAl Viro
Trim security.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
2012-02-01sysctl: Don't call sysctl_follow_link unless we are a link.Eric W. Biederman
There are no functional changes. Just code motion to make it clear that we don't follow a link between sysctl roots unless the directory entry actually is a link. Suggested-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01sysctl: Comments to make the code clearer.Eric W. Biederman
Document get_subdir and that find_subdir alwasy takes a reference. Suggested-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01sysctl: Correct error return from get_subdirEric W. Biederman
When insert_header fails ensure we return the proper error value from get_subdir. In practice nothing cares, but there is no need to be sloppy. Reported-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-02-01sysctl: An easier to read version of find_subdirEric W. Biederman
Suggested-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-30sysctl: fix memset parameters in setup_sysctl_set()Dan Carpenter
The current code is a nop. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-30sysctl: remove an unused variableDan Carpenter
"links" is never used, so we can remove it. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Add register_sysctl for normal sysctl usersEric W. Biederman
The plan is to convert all callers of register_sysctl_table and register_sysctl_paths to register_sysctl. The interface to register_sysctl is enough nicer this should make the callers a bit more readable. Additionally after the conversion the 230 lines of backwards compatibility can be removed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Index sysctl directories with rbtrees.Eric W. Biederman
One of the most important jobs of sysctl is to export network stack tunables. Several of those tunables are per network device. In several instances people are running with 1000+ network devices in there network stacks, which makes the simple per directory linked list in sysctl a scaling bottleneck. Replace O(N^2) sysctl insertion and lookup times with O(NlogN) by using an rbtree to index the sysctl directories. Benchmark before: make-dummies 0 999 -> 0.32s rmmod dummy -> 0.12s make-dummies 0 9999 -> 1m17s rmmod dummy -> 17s Benchmark after: make-dummies 0 999 -> 0.074s rmmod dummy -> 0.070s make-dummies 0 9999 -> 3.4s rmmod dummy -> 0.44s Benchmark after (without dev_snmp6): make-dummies 0 9999 -> 0.75s rmmod dummy -> 0.44s make-dummies 0 99999 -> 11s rmmod dummy -> 4.3s At 10,000 dummy devices the bottleneck becomes the time to add and remove the files under /proc/sys/net/dev_snmp6. I have commented out the code that adds and removes files under /proc/sys/net/dev_snmp6 and taken measurments of creating and destroying 100,000 dummies to verify the sysctl continues to scale. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Make the header lists per directory.Eric W. Biederman
Slightly enhance efficiency and clarity of the code by making the header list per directory instead of per set. Benchmark before: make-dummies 0 999 -> 0.63s rmmod dummy -> 0.12s make-dummies 0 9999 -> 2m35s rmmod dummy -> 18s Benchmark after: make-dummies 0 999 -> 0.32s rmmod dummy -> 0.12s make-dummies 0 9999 -> 1m17s rmmod dummy -> 17s Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Move sysctl_check_dups into insert_headerEric W. Biederman
Simplify the callers of insert_header by removing explicit calls to check for duplicates and instead have insert_header do the work. This makes the code slightly more maintainable by enabling changes to data structures where the insertion of new entries without duplicate suppression is not possible. There is not always a convenient path string where insert_header is called so modify sysctl_check_dups to use sysctl_print_dir when printing the full path when a duplicate is discovered. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Modify __register_sysctl_paths to take a set instead of a root and ↵Eric W. Biederman
an nsproxy An nsproxy argument here has always been awkard and now the nsproxy argument is completely unnecessary so remove it, replacing it with the set we want the registered tables to show up in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Replace root_list with links between sysctl_table_sets.Eric W. Biederman
Piecing together directories by looking first in one directory tree, than in another directory tree and finally in a third directory tree makes it hard to verify that some directory entries are not multiply defined and makes it hard to create efficient implementations the sysctl filesystem. Replace the sysctl wide list of roots with autogenerated links from the core sysctl directory tree to the other sysctl directory trees. This simplifies sysctl directory reading and lookups as now only entries in a single sysctl directory tree need to be considered. Benchmark before: make-dummies 0 999 -> 0.44s rmmod dummy -> 0.065s make-dummies 0 9999 -> 1m36s rmmod dummy -> 0.4s Benchmark after: make-dummies 0 999 -> 0.63s rmmod dummy -> 0.12s make-dummies 0 9999 -> 2m35s rmmod dummy -> 18s The slowdown is caused by the lookups used in insert_headers and put_links to see if we need to add links or remove links. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Add sysctl_print_dir and use it in get_subdirEric W. Biederman
When there are errors it is very nice to know the full sysctl path. Add a simple function that computes the sysctl path and prints it out. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24sysctl: Stop requiring explicit management of sysctl directoriesEric W. Biederman
Simplify the code and the sysctl semantics by autogenerating sysctl directories when a sysctl table is registered that needs the directories and autodeleting the directories when there are no more sysctl tables registered that need them. Autogenerating directories keeps sysctl tables from depending on each other, removing all of the arcane register/unregister ordering constraints and makes it impossible to get the order wrong when reigsering and unregistering sysctl tables. Autogenerating directories yields one unique entity that dentries can point to, retaining the current effective use of the dcache. Add struct ctl_dir as the type of these new autogenerated directories. The attached_by and attached_to fields in ctl_table_header are removed as they are no longer needed. The child field in ctl_table is no longer needed by the core of the sysctl code. ctl_table.child can be removed once all of the existing users have been updated. Benchmark before: make-dummies 0 999 -> 0.7s rmmod dummy -> 0.07s make-dummies 0 9999 -> 1m10s rmmod dummy -> 0.4s Benchmark after: make-dummies 0 999 -> 0.44s rmmod dummy -> 0.065s make-dummies 0 9999 -> 1m36s rmmod dummy -> 0.4s Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>