diff options
author | Christian Brauner <christian.brauner@ubuntu.com> | 2020-05-31 13:50:30 +0200 |
---|---|---|
committer | Kees Cook <keescook@chromium.org> | 2020-07-10 16:01:51 -0700 |
commit | 99cdb8b9a57393b5978e7a6310a2cba511dd179b (patch) | |
tree | ebc2e0c154d2fa845bf8d25736ecfa854e99a834 /kernel/seccomp.c | |
parent | 76194c4e830d570d9e369d637bb907591d2b3111 (diff) |
seccomp: notify about unused filter
We've been making heavy use of the seccomp notifier to intercept and
handle certain syscalls for containers. This patch allows a syscall
supervisor listening on a given notifier to be notified when a seccomp
filter has become unused.
A container is often managed by a singleton supervisor process the
so-called "monitor". This monitor process has an event loop which has
various event handlers registered. If the user specified a seccomp
profile that included a notifier for various syscalls then we also
register a seccomp notify even handler. For any container using a
separate pid namespace the lifecycle of the seccomp notifier is bound to
the init process of the pid namespace, i.e. when the init process exits
the filter must be unused.
If a new process attaches to a container we force it to assume a seccomp
profile. This can either be the same seccomp profile as the container
was started with or a modified one. If the attaching process makes use
of the seccomp notifier we will register a new seccomp notifier handler
in the monitor's event loop. However, when the attaching process exits
we can't simply delete the handler since other child processes could've
been created (daemons spawned etc.) that have inherited the seccomp
filter and so we need to keep the seccomp notifier fd alive in the event
loop. But this is problematic since we don't get a notification when the
seccomp filter has become unused and so we currently never remove the
seccomp notifier fd from the event loop and just keep accumulating fds
in the event loop. We've had this issue for a while but it has recently
become more pressing as more and larger users make use of this.
To fix this, we introduce a new "users" reference counter that tracks any
tasks and dependent filters making use of a filter. When a notifier is
registered waiting tasks will be notified that the filter is now empty
by receiving a (E)POLLHUP event.
The concept in this patch introduces is the same as for signal_struct,
i.e. reference counting for life-cycle management is decoupled from
reference counting taks using the object. There's probably some trickery
possible but the second counter is just the correct way of doing this
IMHO and has precedence.
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Diffstat (limited to 'kernel/seccomp.c')
-rw-r--r-- | kernel/seccomp.c | 44 |
1 files changed, 39 insertions, 5 deletions
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 39c1c687d8ad..ba30bffa9301 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -110,6 +110,14 @@ struct notification { * attached task, once for the dependent filter, and if * requested for the user notifier. When @refs reaches zero, * the filter can be freed. + * @users: A filter's @users count is incremented for each directly + * attached task (filter installation, fork(), thread_sync), + * and once for the dependent filter (tracked in filter->prev). + * When it reaches zero it indicates that no direct or indirect + * users of that filter exist. No new tasks can get associated with + * this filter after reaching 0. The @users count is always smaller + * or equal to @refs. Hence, reaching 0 for @users does not mean + * the filter can be freed. * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate @@ -129,6 +137,7 @@ struct notification { */ struct seccomp_filter { refcount_t refs; + refcount_t users; bool log; struct seccomp_filter *prev; struct bpf_prog *prog; @@ -376,6 +385,15 @@ static inline void seccomp_filter_free(struct seccomp_filter *filter) } } +static void __seccomp_filter_orphan(struct seccomp_filter *orig) +{ + while (orig && refcount_dec_and_test(&orig->users)) { + if (waitqueue_active(&orig->wqh)) + wake_up_poll(&orig->wqh, EPOLLHUP); + orig = orig->prev; + } +} + static void __put_seccomp_filter(struct seccomp_filter *orig) { /* Clean up single-reference branches iteratively. */ @@ -386,10 +404,18 @@ static void __put_seccomp_filter(struct seccomp_filter *orig) } } +static void __seccomp_filter_release(struct seccomp_filter *orig) +{ + /* Notify about any unused filters in the task's former filter tree. */ + __seccomp_filter_orphan(orig); + /* Finally drop all references to the task's former tree. */ + __put_seccomp_filter(orig); +} + /** - * seccomp_filter_release - Detach the task from its filter tree - * and drop its reference count during - * exit. + * seccomp_filter_release - Detach the task from its filter tree, + * drop its reference count, and notify + * about unused filters * * This function should only be called when the task is exiting as * it detaches it from its filter tree. As such, READ_ONCE() and @@ -401,7 +427,7 @@ void seccomp_filter_release(struct task_struct *tsk) /* Detach task from its filter tree. */ tsk->seccomp.filter = NULL; - __put_seccomp_filter(orig); + __seccomp_filter_release(orig); } /** @@ -428,12 +454,15 @@ static inline void seccomp_sync_threads(unsigned long flags) /* Get a task reference for the new leaf node. */ get_seccomp_filter(caller); + /* * Drop the task reference to the shared ancestor since * current's path will hold a reference. (This also * allows a put before the assignment.) */ - __put_seccomp_filter(thread->seccomp.filter); + __seccomp_filter_release(thread->seccomp.filter); + + /* Make our new filter tree visible. */ smp_store_release(&thread->seccomp.filter, caller->seccomp.filter); atomic_set(&thread->seccomp.filter_count, @@ -502,6 +531,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) } refcount_set(&sfilter->refs, 1); + refcount_set(&sfilter->users, 1); init_waitqueue_head(&sfilter->wqh); return sfilter; @@ -606,6 +636,7 @@ void get_seccomp_filter(struct task_struct *tsk) if (!orig) return; __get_seccomp_filter(orig); + refcount_inc(&orig->users); } static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason) @@ -1234,6 +1265,9 @@ static __poll_t seccomp_notify_poll(struct file *file, mutex_unlock(&filter->notify_lock); + if (refcount_read(&filter->users) == 0) + ret |= EPOLLHUP; + return ret; } |