From 25edd8bffd0f7563f0c04c1d219eb89061ce9886 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 4 Sep 2015 15:46:00 -0700 Subject: userfaultfd: linux/Documentation/vm/userfaultfd.txt This is the latest userfaultfd patchset. The postcopy live migration feature on the qemu side is mostly ready to be merged and it entirely depends on the userfaultfd syscall to be merged as well. So it'd be great if this patchset could be reviewed for merging in -mm. Userfaults allow to implement on demand paging from userland and more generally they allow userland to more efficiently take control of the behavior of page faults than what was available before (PROT_NONE + SIGSEGV trap). The use cases are: 1) KVM postcopy live migration (one form of cloud memory externalization). KVM postcopy live migration is the primary driver of this work: http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/ http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html 2) postcopy live migration of binaries inside linux containers: http://thread.gmane.org/gmane.linux.kernel.mm/132662 3) KVM postcopy live snapshotting (allowing to limit/throttle the memory usage, unlike fork would, plus the avoidance of fork overhead in the first place). While the wrprotect tracking is not implemented yet, the syscall API is already contemplating the wrprotect fault tracking and it's generic enough to allow its later implementation in a backwards compatible fashion. 4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method should be extended to work also on tmpfs and then the uffdio_register.ioctls will notify userland that UFFDIO_COPY is available even when the registered virtual memory range is tmpfs backed. 5) alternate mechanism to notify web browsers or apps on embedded devices that volatile pages have been reclaimed. This basically avoids the need to run a syscall before the app can access with the CPU the virtual regions marked volatile. This depends on point 4) to be fulfilled first, as volatile pages happily apply to tmpfs. Even though there wasn't a real use case requesting it yet, it also allows to implement distributed shared memory in a way that readonly shared mappings can exist simultaneously in different hosts and they can be become exclusive at the first wrprotect fault. This patch (of 22): Add documentation. Signed-off-by: Andrea Arcangeli Acked-by: Pavel Emelyanov Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/userfaultfd.txt | 142 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) create mode 100644 Documentation/vm/userfaultfd.txt (limited to 'Documentation') diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt new file mode 100644 index 000000000000..90912925425e --- /dev/null +++ b/Documentation/vm/userfaultfd.txt @@ -0,0 +1,142 @@ += Userfaultfd = + +== Objective == + +Userfaults allow the implementation of on-demand paging from userland +and more generally they allow userland to take control of various +memory page faults, something otherwise only the kernel code could do. + +For example userfaults allows a proper and more optimal implementation +of the PROT_NONE+SIGSEGV trick. + +== Design == + +Userfaults are delivered and resolved through the userfaultfd syscall. + +The userfaultfd (aside from registering and unregistering virtual +memory ranges) provides two primary functionalities: + +1) read/POLLIN protocol to notify a userland thread of the faults + happening + +2) various UFFDIO_* ioctls that can manage the virtual memory regions + registered in the userfaultfd that allows userland to efficiently + resolve the userfaults it receives via 1) or to manage the virtual + memory in the background + +The real advantage of userfaults if compared to regular virtual memory +management of mremap/mprotect is that the userfaults in all their +operations never involve heavyweight structures like vmas (in fact the +userfaultfd runtime load never takes the mmap_sem for writing). + +Vmas are not suitable for page- (or hugepage) granular fault tracking +when dealing with virtual address spaces that could span +Terabytes. Too many vmas would be needed for that. + +The userfaultfd once opened by invoking the syscall, can also be +passed using unix domain sockets to a manager process, so the same +manager process could handle the userfaults of a multitude of +different processes without them being aware about what is going on +(well of course unless they later try to use the userfaultfd +themselves on the same region the manager is already tracking, which +is a corner case that would currently return -EBUSY). + +== API == + +When first opened the userfaultfd must be enabled invoking the +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or +a later API version) which will specify the read/POLLIN protocol +userland intends to speak on the UFFD. The UFFDIO_API ioctl if +successful (i.e. if the requested uffdio_api.api is spoken also by the +running kernel), will return into uffdio_api.features and +uffdio_api.ioctls two 64bit bitmasks of respectively the activated +feature of the read(2) protocol and the generic ioctl available. + +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should +be invoked (if present in the returned uffdio_api.ioctls bitmask) to +register a memory range in the userfaultfd by setting the +uffdio_register structure accordingly. The uffdio_register.mode +bitmask will specify to the kernel which kind of faults to track for +the range (UFFDIO_REGISTER_MODE_MISSING would track missing +pages). The UFFDIO_REGISTER ioctl will return the +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve +userfaults on the range registered. Not all ioctls will necessarily be +supported for all memory types depending on the underlying virtual +memory backend (anonymous memory vs tmpfs vs real filebacked +mappings). + +Userland can use the uffdio_register.ioctls to manage the virtual +address space in the background (to add or potentially also remove +memory from the userfaultfd registered range). This means a userfault +could be triggering just before userland maps in the background the +user-faulted page. + +The primary ioctl to resolve userfaults is UFFDIO_COPY. That +atomically copies a page into the userfault registered range and wakes +up the blocked userfaults (unless uffdio_copy.mode & +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to +UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an +half copied page since it'll keep userfaulting until the copy has +finished. + +== QEMU/KVM == + +QEMU/KVM is using the userfaultfd syscall to implement postcopy live +migration. Postcopy live migration is one form of memory +externalization consisting of a virtual machine running with part or +all of its memory residing on a different node in the cloud. The +userfaultfd abstraction is generic enough that not a single line of +KVM kernel code had to be modified in order to add postcopy live +migration to QEMU. + +Guest async page faults, FOLL_NOWAIT and all other GUP features work +just fine in combination with userfaults. Userfaults trigger async +page faults in the guest scheduler so those guest processes that +aren't waiting for userfaults (i.e. network bound) can keep running in +the guest vcpus. + +It is generally beneficial to run one pass of precopy live migration +just before starting postcopy live migration, in order to avoid +generating userfaults for readonly guest regions. + +The implementation of postcopy live migration currently uses one +single bidirectional socket but in the future two different sockets +will be used (to reduce the latency of the userfaults to the minimum +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). + +The QEMU in the source node writes all pages that it knows are missing +in the destination node, into the socket, and the migration thread of +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE +ioctls on the userfaultfd in order to map the received pages into the +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). + +A different postcopy thread in the destination node listens with +poll() to the userfaultfd in parallel. When a POLLIN event is +generated after a userfault triggers, the postcopy thread read() from +the userfaultfd and receives the fault address (or -EAGAIN in case the +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run +by the parallel QEMU migration thread). + +After the QEMU postcopy thread (running in the destination node) gets +the userfault address it writes the information about the missing page +into the socket. The QEMU source node receives the information and +roughly "seeks" to that page address and continues sending all +remaining missing pages from that new page offset. Soon after that +(just the time to flush the tcp_wmem queue through the network) the +migration thread in the QEMU running in the destination node will +receive the page that triggered the userfault and it'll map it as +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it +was spontaneously sent by the source or if it was an urgent page +requested through an userfault). + +By the time the userfaults start, the QEMU in the destination node +doesn't need to keep any per-page state bitmap relative to the live +migration around and a single per-page bitmap has to be maintained in +the QEMU running in the source node to know which pages are still +missing in the destination node. The bitmap in the source node is +checked to find which missing pages to send in round robin and we seek +over it when receiving incoming userfaults. After sending each page of +course the bitmap is updated accordingly. It's also useful to avoid +sending the same page twice (in case the userfault is read by the +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration +thread). -- cgit v1.2.3-58-ga151 From 1038628d80e96e3a086189172d9be8eb85ecfabf Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 4 Sep 2015 15:46:04 -0700 Subject: userfaultfd: uAPI Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol. Signed-off-by: Andrea Arcangeli Acked-by: Pavel Emelyanov Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ioctl/ioctl-number.txt | 1 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/userfaultfd.h | 83 ++++++++++++++++++++++++++++++++++++ 3 files changed, 85 insertions(+) create mode 100644 include/uapi/linux/userfaultfd.h (limited to 'Documentation') diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 64df08db4657..39ac6546d4a4 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -303,6 +303,7 @@ Code Seq#(hex) Include File Comments 0xA3 80-8F Port ACL in development: 0xA3 90-9F linux/dtlk.h +0xAA 00-3F linux/uapi/linux/userfaultfd.h 0xAB 00-1F linux/nbd.h 0xAC 00-1F linux/raw.h 0xAD 00 Netfilter device in development: diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index aafb9937b162..70ff1d9abf0d 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -456,3 +456,4 @@ header-y += xfrm.h header-y += xilinx-v4l2-controls.h header-y += zorro.h header-y += zorro_ids.h +header-y += userfaultfd.h diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h new file mode 100644 index 000000000000..09c2e2a8c9d6 --- /dev/null +++ b/include/uapi/linux/userfaultfd.h @@ -0,0 +1,83 @@ +/* + * include/linux/userfaultfd.h + * + * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2015 Red Hat, Inc. + * + */ + +#ifndef _LINUX_USERFAULTFD_H +#define _LINUX_USERFAULTFD_H + +#include + +#define UFFD_API ((__u64)0xAA) +/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */ +#define UFFD_API_BITS (UFFD_BIT_WRITE) +#define UFFD_API_IOCTLS \ + ((__u64)1 << _UFFDIO_REGISTER | \ + (__u64)1 << _UFFDIO_UNREGISTER | \ + (__u64)1 << _UFFDIO_API) +#define UFFD_API_RANGE_IOCTLS \ + ((__u64)1 << _UFFDIO_WAKE) + +/* + * Valid ioctl command number range with this API is from 0x00 to + * 0x3F. UFFDIO_API is the fixed number, everything else can be + * changed by implementing a different UFFD_API. If sticking to the + * same UFFD_API more ioctl can be added and userland will be aware of + * which ioctl the running kernel implements through the ioctl command + * bitmask written by the UFFDIO_API. + */ +#define _UFFDIO_REGISTER (0x00) +#define _UFFDIO_UNREGISTER (0x01) +#define _UFFDIO_WAKE (0x02) +#define _UFFDIO_API (0x3F) + +/* userfaultfd ioctl ids */ +#define UFFDIO 0xAA +#define UFFDIO_API _IOWR(UFFDIO, _UFFDIO_API, \ + struct uffdio_api) +#define UFFDIO_REGISTER _IOWR(UFFDIO, _UFFDIO_REGISTER, \ + struct uffdio_register) +#define UFFDIO_UNREGISTER _IOR(UFFDIO, _UFFDIO_UNREGISTER, \ + struct uffdio_range) +#define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \ + struct uffdio_range) + +/* + * Valid bits below PAGE_SHIFT in the userfault address read through + * the read() syscall. + */ +#define UFFD_BIT_WRITE (1<<0) /* this was a write fault, MISSING or WP */ +#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */ +#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */ + +struct uffdio_api { + /* userland asks for an API number */ + __u64 api; + + /* kernel answers below with the available features for the API */ + __u64 bits; + __u64 ioctls; +}; + +struct uffdio_range { + __u64 start; + __u64 len; +}; + +struct uffdio_register { + struct uffdio_range range; +#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0) +#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1) + __u64 mode; + + /* + * kernel answers which ioctl commands are available for the + * range, keep at the end as the last 8 bytes aren't read. + */ + __u64 ioctls; +}; + +#endif /* _LINUX_USERFAULTFD_H */ -- cgit v1.2.3-58-ga151 From a9b85f9415fd9e529d03299e5335433f614ec1fb Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 4 Sep 2015 15:46:37 -0700 Subject: userfaultfd: change the read API to return a uffd_msg I had requests to return the full address (not the page aligned one) to userland. It's not entirely clear how the page offset could be relevant because userfaults aren't like SIGBUS that can sigjump to a different place and it actually skip resolving the fault depending on a page offset. There's currently no real way to skip the fault especially because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the kernel without having to return to userland first (not even self modifying code replacing the .text that touched the faulting address would prevent the fault to be repeated). Userland cannot skip repeating the fault even more so if the fault was triggered by a KVM secondary page fault or any get_user_pages or any copy-user inside some syscall which will return to kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS being raised because the userfault can't wait if it cannot release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that). Still returning userland a proper structure during the read() on the uffd, can allow to use the current UFFD_API for the future non-cooperative extensions too and it looks cleaner as well. Once we get additional fields there's no point to return the fault address page aligned anymore to reuse the bits below PAGE_SHIFT. The only downside is that the read() syscall will read 32bytes instead of 8bytes but that's not going to be measurable overhead. The total number of new events that can be extended or of new future bits for already shipped events, is limited to 64 by the features field of the uffdio_api structure. If more will be needed a bump of UFFD_API will be required. [akpm@linux-foundation.org: use __packed] Signed-off-by: Andrea Arcangeli Acked-by: Pavel Emelyanov Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/userfaultfd.txt | 12 +++--- fs/userfaultfd.c | 79 +++++++++++++++++++++++----------------- include/uapi/linux/userfaultfd.h | 70 +++++++++++++++++++++++++++-------- 3 files changed, 108 insertions(+), 53 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt index 90912925425e..70a3c94d1941 100644 --- a/Documentation/vm/userfaultfd.txt +++ b/Documentation/vm/userfaultfd.txt @@ -46,11 +46,13 @@ is a corner case that would currently return -EBUSY). When first opened the userfaultfd must be enabled invoking the UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or a later API version) which will specify the read/POLLIN protocol -userland intends to speak on the UFFD. The UFFDIO_API ioctl if -successful (i.e. if the requested uffdio_api.api is spoken also by the -running kernel), will return into uffdio_api.features and -uffdio_api.ioctls two 64bit bitmasks of respectively the activated -feature of the read(2) protocol and the generic ioctl available. +userland intends to speak on the UFFD and the uffdio_api.features +userland requires. The UFFDIO_API ioctl if successful (i.e. if the +requested uffdio_api.api is spoken also by the running kernel and the +requested features are going to be enabled) will return into +uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of +respectively all the available features of the read(2) protocol and +the generic ioctl available. Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should be invoked (if present in the returned uffdio_api.ioctls bitmask) to diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 0756d97b0666..1f2ddaaf3c03 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -50,7 +50,7 @@ struct userfaultfd_ctx { }; struct userfaultfd_wait_queue { - unsigned long address; + struct uffd_msg msg; wait_queue_t wq; bool pending; struct userfaultfd_ctx *ctx; @@ -77,7 +77,8 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode, /* len == 0 means wake all */ start = range->start; len = range->len; - if (len && (start > uwq->address || start + len <= uwq->address)) + if (len && (start > uwq->msg.arg.pagefault.address || + start + len <= uwq->msg.arg.pagefault.address)) goto out; ret = wake_up_state(wq->private, mode); if (ret) @@ -135,28 +136,43 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx) } } -static inline unsigned long userfault_address(unsigned long address, - unsigned int flags, - unsigned long reason) +static inline void msg_init(struct uffd_msg *msg) { - BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS); - address &= PAGE_MASK; + BUILD_BUG_ON(sizeof(struct uffd_msg) != 32); + /* + * Must use memset to zero out the paddings or kernel data is + * leaked to userland. + */ + memset(msg, 0, sizeof(struct uffd_msg)); +} + +static inline struct uffd_msg userfault_msg(unsigned long address, + unsigned int flags, + unsigned long reason) +{ + struct uffd_msg msg; + msg_init(&msg); + msg.event = UFFD_EVENT_PAGEFAULT; + msg.arg.pagefault.address = address; if (flags & FAULT_FLAG_WRITE) /* - * Encode "write" fault information in the LSB of the - * address read by userland, without depending on - * FAULT_FLAG_WRITE kernel internal value. + * If UFFD_FEATURE_PAGEFAULT_FLAG_WRITE was set in the + * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE + * was not set in a UFFD_EVENT_PAGEFAULT, it means it + * was a read fault, otherwise if set it means it's + * a write fault. */ - address |= UFFD_BIT_WRITE; + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE; if (reason & VM_UFFD_WP) /* - * Encode "reason" fault information as bit number 1 - * in the address read by userland. If bit number 1 is - * clear it means the reason is a VM_FAULT_MISSING - * fault. + * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the + * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was + * not set in a UFFD_EVENT_PAGEFAULT, it means it was + * a missing fault, otherwise if set it means it's a + * write protect fault. */ - address |= UFFD_BIT_WP; - return address; + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP; + return msg; } /* @@ -242,7 +258,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address, init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function); uwq.wq.private = current; - uwq.address = userfault_address(address, flags, reason); + uwq.msg = userfault_msg(address, flags, reason); uwq.pending = true; uwq.ctx = ctx; @@ -398,7 +414,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) } static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, - __u64 *addr) + struct uffd_msg *msg) { ssize_t ret; DECLARE_WAITQUEUE(wait, current); @@ -416,8 +432,8 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, * disappear from under us. */ uwq->pending = false; - /* careful to always initialize addr if ret == 0 */ - *addr = uwq->address; + /* careful to always initialize msg if ret == 0 */ + *msg = uwq->msg; spin_unlock(&ctx->fault_wqh.lock); ret = 0; break; @@ -447,8 +463,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, { struct userfaultfd_ctx *ctx = file->private_data; ssize_t _ret, ret = 0; - /* careful to always initialize addr if ret == 0 */ - __u64 uninitialized_var(addr); + struct uffd_msg msg; int no_wait = file->f_flags & O_NONBLOCK; if (ctx->state == UFFD_STATE_WAIT_API) @@ -456,16 +471,16 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, BUG_ON(ctx->state != UFFD_STATE_RUNNING); for (;;) { - if (count < sizeof(addr)) + if (count < sizeof(msg)) return ret ? ret : -EINVAL; - _ret = userfaultfd_ctx_read(ctx, no_wait, &addr); + _ret = userfaultfd_ctx_read(ctx, no_wait, &msg); if (_ret < 0) return ret ? ret : _ret; - if (put_user(addr, (__u64 __user *) buf)) + if (copy_to_user((__u64 __user *) buf, &msg, sizeof(msg))) return ret ? ret : -EFAULT; - ret += sizeof(addr); - buf += sizeof(addr); - count -= sizeof(addr); + ret += sizeof(msg); + buf += sizeof(msg); + count -= sizeof(msg); /* * Allow to read more than one fault at time but only * block if waiting for the very first one. @@ -873,17 +888,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, if (ctx->state != UFFD_STATE_WAIT_API) goto out; ret = -EFAULT; - if (copy_from_user(&uffdio_api, buf, sizeof(__u64))) + if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api))) goto out; - if (uffdio_api.api != UFFD_API) { - /* careful not to leak info, we only read the first 8 bytes */ + if (uffdio_api.api != UFFD_API || uffdio_api.features) { memset(&uffdio_api, 0, sizeof(uffdio_api)); if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api))) goto out; ret = -EINVAL; goto out; } - /* careful not to leak info, we only read the first 8 bytes */ uffdio_api.features = UFFD_API_FEATURES; uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 330206016249..a5f8825381ef 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -11,9 +11,15 @@ #include +#include + #define UFFD_API ((__u64)0xAA) -/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */ -#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT) +/* + * After implementing the respective features it will become: + * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \ + * UFFD_FEATURE_EVENT_FORK) + */ +#define UFFD_API_FEATURES (0) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -45,26 +51,60 @@ #define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \ struct uffdio_range) -/* - * Valid bits below PAGE_SHIFT in the userfault address read through - * the read() syscall. - */ -#define UFFD_BIT_WRITE (1<<0) /* this was a write fault, MISSING or WP */ -#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */ -#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */ +/* read() structure */ +struct uffd_msg { + __u8 event; + + __u8 reserved1; + __u16 reserved2; + __u32 reserved3; + + union { + struct { + __u64 flags; + __u64 address; + } pagefault; + + struct { + /* unused reserved fields */ + __u64 reserved1; + __u64 reserved2; + __u64 reserved3; + } reserved; + } arg; +} __packed; /* - * Features reported in uffdio_api.features field + * Start at 0x12 and not at 0 to be more strict against bugs. */ -#define UFFD_FEATURE_WRITE_BIT (1<<0) /* Corresponds to UFFD_BIT_WRITE */ -#define UFFD_FEATURE_WP_BIT (1<<1) /* Corresponds to UFFD_BIT_WP */ +#define UFFD_EVENT_PAGEFAULT 0x12 +#if 0 /* not available yet */ +#define UFFD_EVENT_FORK 0x13 +#endif + +/* flags for UFFD_EVENT_PAGEFAULT */ +#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */ +#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */ struct uffdio_api { - /* userland asks for an API number */ + /* userland asks for an API number and the features to enable */ __u64 api; - - /* kernel answers below with the available features for the API */ + /* + * Kernel answers below with the all available features for + * the API, this notifies userland of which events and/or + * which flags for each event are enabled in the current + * kernel. + * + * Note: UFFD_EVENT_PAGEFAULT and UFFD_PAGEFAULT_FLAG_WRITE + * are to be considered implicitly always enabled in all kernels as + * long as the uffdio_api.api requested matches UFFD_API. + */ +#if 0 /* not available yet */ +#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) +#define UFFD_FEATURE_EVENT_FORK (1<<1) +#endif __u64 features; + __u64 ioctls; }; -- cgit v1.2.3-58-ga151 From c7e1e3ccfbd153c890240a391f258efaedfa94d0 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 4 Sep 2015 15:47:38 -0700 Subject: Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap Signed-off-by: Mel Gorman Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/features/vm/TLB/arch-support.txt | 40 ++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 Documentation/features/vm/TLB/arch-support.txt (limited to 'Documentation') diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt new file mode 100644 index 000000000000..261b92e2fb1a --- /dev/null +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -0,0 +1,40 @@ +# +# Feature name: batch-unmap-tlb-flush +# Kconfig: ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH +# description: arch supports deferral of TLB flush until multiple pages are unmapped +# + ----------------------- + | arch |status| + ----------------------- + | alpha: | TODO | + | arc: | TODO | + | arm: | TODO | + | arm64: | TODO | + | avr32: | .. | + | blackfin: | TODO | + | c6x: | .. | + | cris: | .. | + | frv: | .. | + | h8300: | .. | + | hexagon: | TODO | + | ia64: | TODO | + | m32r: | TODO | + | m68k: | .. | + | metag: | TODO | + | microblaze: | .. | + | mips: | TODO | + | mn10300: | TODO | + | nios2: | .. | + | openrisc: | .. | + | parisc: | TODO | + | powerpc: | TODO | + | s390: | TODO | + | score: | .. | + | sh: | TODO | + | sparc: | TODO | + | tile: | TODO | + | um: | .. | + | unicore32: | .. | + | x86: | ok | + | xtensa: | TODO | + ----------------------- -- cgit v1.2.3-58-ga151