summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-10-13f2fs: handle errors of f2fs_get_meta_page_nofailJaegeuk Kim
First problem is we hit BUG_ON() in f2fs_get_sum_page given EIO on f2fs_get_meta_page_nofail(). Quick fix was not to give any error with infinite loop, but syzbot caught a case where it goes to that loop from fuzzed image. In turned out we abused f2fs_get_meta_page_nofail() like in the below call stack. - f2fs_fill_super - f2fs_build_segment_manager - build_sit_entries - get_current_sit_page INFO: task syz-executor178:6870 can't die for more than 143 seconds. task:syz-executor178 state:R stack:26960 pid: 6870 ppid: 6869 flags:0x00004006 Call Trace: Showing all locks held in the system: 1 lock held by khungtaskd/1179: #0: ffffffff8a554da0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6242 1 lock held by systemd-journal/3920: 1 lock held by in:imklog/6769: #0: ffff88809eebc130 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:930 1 lock held by syz-executor178/6870: #0: ffff8880925120e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0x201/0xaf0 fs/super.c:229 Actually, we didn't have to use _nofail in this case, since we could return error to mount(2) already with the error handler. As a result, this patch tries to 1) remove _nofail callers as much as possible, 2) deal with error case in last remaining caller, f2fs_get_sum_page(). Reported-by: syzbot+ee250ac8137be41d7b13@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-09f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inodeChao Yu
If compressed inode has inconsistent fields on i_compress_algorithm, i_compr_blocks and i_log_cluster_size, we missed to set SBI_NEED_FSCK to notice fsck to repair the inode, fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-08f2fs: reject CASEFOLD inode flag without casefold featureEric Biggers
syzbot reported: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 6860 Comm: syz-executor835 Not tainted 5.9.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:utf8_casefold+0x43/0x1b0 fs/unicode/utf8-core.c:107 [...] Call Trace: f2fs_init_casefolded_name fs/f2fs/dir.c:85 [inline] __f2fs_setup_filename fs/f2fs/dir.c:118 [inline] f2fs_prepare_lookup+0x3bf/0x640 fs/f2fs/dir.c:163 f2fs_lookup+0x10d/0x920 fs/f2fs/namei.c:494 __lookup_hash+0x115/0x240 fs/namei.c:1445 filename_create+0x14b/0x630 fs/namei.c:3467 user_path_create fs/namei.c:3524 [inline] do_mkdirat+0x56/0x310 fs/namei.c:3664 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [...] The problem is that an inode has F2FS_CASEFOLD_FL set, but the filesystem doesn't have the casefold feature flag set, and therefore super_block::s_encoding is NULL. Fix this by making sanity_check_inode() reject inodes that have F2FS_CASEFOLD_FL when the filesystem doesn't have the casefold feature. Reported-by: syzbot+05139c4039d0679e19ff@syzkaller.appspotmail.com Fixes: 2c2eb7a300cd ("f2fs: Support case-insensitive file name lookups") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-08f2fs: fix memory alignment to support 32bitJaegeuk Kim
In 32bit system, 64-bits key breaks memory alignment. This fixes the commit "f2fs: support 64-bits key in f2fs rb-tree node entry". Reported-by: Nicolas Chauvet <kwizart@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: fix slab leak of rpages pointerJaegeuk Kim
This fixes the below mem leak. [ 130.157600] ============================================================================= [ 130.159662] BUG f2fs_page_array_entry-252:16 (Tainted: G W O ): Objects remaining in f2fs_page_array_entry-252:16 on __kmem_cache_shutdown() [ 130.162742] ----------------------------------------------------------------------------- [ 130.162742] [ 130.164979] Disabling lock debugging due to kernel taint [ 130.166188] INFO: Slab 0x000000009f5a52d2 objects=22 used=4 fp=0x00000000ba72c3e9 flags=0xfffffc0010200 [ 130.168269] CPU: 7 PID: 3560 Comm: umount Tainted: G B W O 5.9.0-rc4+ #35 [ 130.170019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 130.171941] Call Trace: [ 130.172528] dump_stack+0x74/0x9a [ 130.173298] slab_err+0xb7/0xdc [ 130.174044] ? kernel_poison_pages+0xc0/0xc0 [ 130.175065] ? on_each_cpu_cond_mask+0x48/0x90 [ 130.176096] __kmem_cache_shutdown.cold+0x34/0x141 [ 130.177190] kmem_cache_destroy+0x59/0x100 [ 130.178223] f2fs_destroy_page_array_cache+0x15/0x20 [f2fs] [ 130.179527] f2fs_put_super+0x1bc/0x380 [f2fs] [ 130.180538] generic_shutdown_super+0x72/0x110 [ 130.181547] kill_block_super+0x27/0x50 [ 130.182438] kill_f2fs_super+0x76/0xe0 [f2fs] [ 130.183448] deactivate_locked_super+0x3b/0x80 [ 130.184456] deactivate_super+0x3e/0x50 [ 130.185363] cleanup_mnt+0x109/0x160 [ 130.186179] __cleanup_mnt+0x12/0x20 [ 130.187003] task_work_run+0x70/0xb0 [ 130.187841] exit_to_user_mode_prepare+0x18f/0x1b0 [ 130.188917] syscall_exit_to_user_mode+0x31/0x170 [ 130.189989] do_syscall_64+0x45/0x90 [ 130.190828] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 130.191986] RIP: 0033:0x7faf868ea2eb [ 130.192815] Code: 7b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 7b 0c 00 f7 d8 64 89 01 [ 130.196872] RSP: 002b:00007fffb7edb478 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 130.198494] RAX: 0000000000000000 RBX: 00007faf86a18204 RCX: 00007faf868ea2eb [ 130.201021] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055971df71c50 [ 130.203415] RBP: 000055971df71a40 R08: 0000000000000000 R09: 00007fffb7eda1f0 [ 130.205772] R10: 00007faf86a04339 R11: 0000000000000246 R12: 000055971df71c50 [ 130.208150] R13: 0000000000000000 R14: 000055971df71b38 R15: 0000000000000000 [ 130.210515] INFO: Object 0x00000000a980843a @offset=744 [ 130.212476] INFO: Allocated in page_array_alloc+0x3d/0xe0 [f2fs] age=1572 cpu=0 pid=3297 [ 130.215030] __slab_alloc+0x20/0x40 [ 130.216566] kmem_cache_alloc+0x2a0/0x2e0 [ 130.218217] page_array_alloc+0x3d/0xe0 [f2fs] [ 130.219940] f2fs_init_compress_ctx+0x1f/0x40 [f2fs] [ 130.221736] f2fs_write_cache_pages+0x3db/0x860 [f2fs] [ 130.223591] f2fs_write_data_pages+0x2c9/0x300 [f2fs] [ 130.225414] do_writepages+0x43/0xd0 [ 130.226907] __filemap_fdatawrite_range+0xd5/0x110 [ 130.228632] filemap_write_and_wait_range+0x48/0xb0 [ 130.230336] __generic_file_write_iter+0x18a/0x1d0 [ 130.232035] f2fs_file_write_iter+0x226/0x550 [f2fs] [ 130.233737] new_sync_write+0x113/0x1a0 [ 130.235204] vfs_write+0x1a6/0x200 [ 130.236579] ksys_write+0x67/0xe0 [ 130.237898] __x64_sys_write+0x1a/0x20 [ 130.239309] do_syscall_64+0x38/0x90 Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: compress: fix to disallow enabling compress on non-empty fileChao Yu
Compressed inode and normal inode has different layout, so we should disallow enabling compress on non-empty file to avoid race condition during inode .i_addr array parsing and updating. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: Fix missing condition] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: compress: introduce cic/dic slab cacheChao Yu
Add two slab caches: "f2fs_cic_entry" and "f2fs_dic_entry" for memory allocation of compress_io_ctx and decompress_io_ctx structure. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: compress: introduce page array slab cacheChao Yu
Add a per-sbi slab cache "f2fs_page_array_entry-%u:%u" for memory allocation of page pointer array in compress context. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: Fix wrong memory allocation] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: fix to do sanity check on segment/section countChao Yu
As syzbot reported: BUG: KASAN: slab-out-of-bounds in init_min_max_mtime fs/f2fs/segment.c:4710 [inline] BUG: KASAN: slab-out-of-bounds in f2fs_build_segment_manager+0x9302/0xa6d0 fs/f2fs/segment.c:4792 Read of size 8 at addr ffff8880a1b934a8 by task syz-executor682/6878 CPU: 1 PID: 6878 Comm: syz-executor682 Not tainted 5.9.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 init_min_max_mtime fs/f2fs/segment.c:4710 [inline] f2fs_build_segment_manager+0x9302/0xa6d0 fs/f2fs/segment.c:4792 f2fs_fill_super+0x381a/0x6e80 fs/f2fs/super.c:3633 mount_bdev+0x32e/0x3f0 fs/super.c:1417 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x1387/0x20a0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The root cause is: if segs_per_sec is larger than one, and segment count in last section is less than segs_per_sec, we will suffer out-of-boundary memory access on sit_i->sentries[] in init_min_max_mtime(). Fix this by adding sanity check among segment count, section count and segs_per_sec value in sanity_check_raw_super(). Reported-by: syzbot+481a3ffab50fed41dcc0@syzkaller.appspotmail.com Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: fix to check segment boundary during SIT page readaheadChao Yu
As syzbot reported: kernel BUG at fs/f2fs/segment.h:657! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 16220 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:f2fs_ra_meta_pages+0xa51/0xdc0 fs/f2fs/segment.h:657 Call Trace: build_sit_entries fs/f2fs/segment.c:4195 [inline] f2fs_build_segment_manager+0x4b8a/0xa3c0 fs/f2fs/segment.c:4779 f2fs_fill_super+0x377d/0x6b80 fs/f2fs/super.c:3633 mount_bdev+0x32e/0x3f0 fs/super.c:1417 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x1387/0x2070 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 @blkno in f2fs_ra_meta_pages could exceed max segment count, causing panic in following sanity check in current_sit_addr(), add check condition to avoid this issue. Reported-by: syzbot+3698081bcf0bb2d12174@syzkaller.appspotmail.com Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: fix uninit-value in f2fs_lookupChao Yu
As syzbot reported: Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x21c/0x280 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219 f2fs_lookup+0xe05/0x1a80 fs/f2fs/namei.c:503 lookup_open fs/namei.c:3082 [inline] open_last_lookups fs/namei.c:3177 [inline] path_openat+0x2729/0x6a90 fs/namei.c:3365 do_filp_open+0x2b8/0x710 fs/namei.c:3395 do_sys_openat2+0xa88/0x1140 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_compat_sys_openat fs/open.c:1242 [inline] __se_compat_sys_openat+0x2a4/0x310 fs/open.c:1240 __ia32_compat_sys_openat+0x56/0x70 fs/open.c:1240 do_syscall_32_irqs_on arch/x86/entry/common.c:80 [inline] __do_fast_syscall_32+0x129/0x180 arch/x86/entry/common.c:139 do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:162 do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:205 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c In f2fs_lookup(), @res_page could be used before being initialized, because in __f2fs_find_entry(), once F2FS_I(dir)->i_current_depth was been fuzzed to zero, then @res_page will never be initialized, causing this kmsan warning, relocating @res_page initialization place to fix this bug. Reported-by: syzbot+0eac6f0bbd558fd866d7@syzkaller.appspotmail.com Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: remove unneeded parameter in find_in_block()Chao Yu
We can relocate @res_page assignment in find_in_block() to its caller, so unneeded parameter could be removed for cleanup. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: fix wrong total_sections check and fsmeta checkWang Xiaojun
Meta area is not included in section_count computation. So the minimum number of total_sections is 1 meanwhile it cannot be greater than segment_count_main. The minimum number of meta segments is 8 (SB + 2 (CP + SIT + NAT) + SSA). Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: remove duplicated code in sanity_check_area_boundaryWang Xiaojun
Use seg_end_blkaddr instead of "segment0_blkaddr + (segment_count << log_blocks_per_seg)". Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: remove unused check on version_bitmapWang Xiaojun
A NULL will not be return by __bitmap_ptr here. Remove the unused check. Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: relocate blkzoned feature checkChao Yu
Relocate blkzoned feature check into parse_options() like other feature check. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: do sanity check on zoned block device pathChao Yu
sbi->devs would be initialized only if image enables multiple device feature or blkzoned feature, if blkzoned feature flag was set by fuzz in non-blkzoned device, we will suffer below panic: get_zone_idx fs/f2fs/segment.c:4892 [inline] f2fs_usable_zone_blks_in_seg fs/f2fs/segment.c:4943 [inline] f2fs_usable_blks_in_seg+0x39b/0xa00 fs/f2fs/segment.c:4999 Call Trace: check_block_count+0x69/0x4e0 fs/f2fs/segment.h:704 build_sit_entries fs/f2fs/segment.c:4403 [inline] f2fs_build_segment_manager+0x51da/0xa370 fs/f2fs/segment.c:5100 f2fs_fill_super+0x3880/0x6ff0 fs/f2fs/super.c:3684 mount_bdev+0x32e/0x3f0 fs/super.c:1417 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2896 [inline] path_mount+0x12ae/0x1e70 fs/namespace.c:3216 do_mount fs/namespace.c:3229 [inline] __do_sys_mount fs/namespace.c:3437 [inline] __se_sys_mount fs/namespace.c:3414 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3414 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 Add sanity check to inconsistency on factors: blkzoned flag, device path and device character to avoid above panic. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: add trace exit in exception pathZhang Qilong
Missing the trace exit in f2fs_sync_dirty_inodes Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-29f2fs: change return value of reserved_segments to unsigned intXiaojun Wang
The type of SM_I(sbi)->reserved_segments is unsigned int, so change the return value to unsigned int. The type cast can be removed in reserved_sections as a result. Signed-off-by: Xiaojun Wang <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-14f2fs: clean up kvfreeChao Yu
After commit 0b6d4ca04a86 ("f2fs: don't return vmalloc() memory from f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed memory, so clean up to use kfree() instead of kvfree() to free vmalloc()'ed memory. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change virtual mapping way for compression pagesDaeho Jeong
By profiling f2fs compression works, I've found vmap() callings have unexpected hikes in the execution time in our test environment and those are bottlenecks of f2fs decompression path. Changing these with vm_map_ram(), we can enhance f2fs decompression speed pretty much. [Verification] Android Pixel 3(ARM64, 6GB RAM, 128GB UFS) Turned on only 0-3 little cores(at 1.785GHz) dd if=/dev/zero of=dummy bs=1m count=1000 echo 3 > /proc/sys/vm/drop_caches dd if=dummy of=/dev/zero bs=512k - w/o compression - 1048576000 bytes (0.9 G) copied, 2.082554 s, 480 M/s 1048576000 bytes (0.9 G) copied, 2.081634 s, 480 M/s 1048576000 bytes (0.9 G) copied, 2.090861 s, 478 M/s - before patch - 1048576000 bytes (0.9 G) copied, 7.407527 s, 135 M/s 1048576000 bytes (0.9 G) copied, 7.283734 s, 137 M/s 1048576000 bytes (0.9 G) copied, 7.291508 s, 137 M/s - after patch - 1048576000 bytes (0.9 G) copied, 1.998959 s, 500 M/s 1048576000 bytes (0.9 G) copied, 1.987554 s, 503 M/s 1048576000 bytes (0.9 G) copied, 1.986380 s, 503 M/s Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change return value of f2fs_disable_compressed_file to boolDaeho Jeong
The returned integer is not required anywhere. So we need to change the return value to bool type. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change i_compr_blocks of inode to atomic valueDaeho Jeong
writepages() can be concurrently invoked for the same file by different threads such as a thread fsyncing the file and a kworker kernel thread. So, changing i_compr_blocks without protection is racy and we need to protect it by changing it with atomic type value. Plus, we don't need a 64bit value for i_compr_blocks, so just we will use a atomic value, not atomic64. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: trace: fix typoChao Yu
Fixes a typo from 'compreesed' to 'compressed'. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: ignore compress mount option on image w/o compression featureChao Yu
to keep consistent with behavior when passing compress mount option to kernel w/o compression feature, so that mount may not fail on such condition. Reported-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: Documentation edits/fixesRandy Dunlap
Correct grammar and spelling. Drop duplicate section for resize.f2fs. Change one occurrence of F2fs to F2FS for consistency. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chao Yu <yuchao0@huawei.com> Cc: linux-f2fs-devel@lists.sourceforge.net Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: allocate proper size memory for zstd decompressChao Yu
As 5kft <5kft@5kft.org> reported: kworker/u9:3: page allocation failure: order:9, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 3 PID: 8168 Comm: kworker/u9:3 Tainted: G C 5.8.3-sunxi #trunk Hardware name: Allwinner sun8i Family Workqueue: f2fs_post_read_wq f2fs_post_read_work [<c010d6d5>] (unwind_backtrace) from [<c0109a55>] (show_stack+0x11/0x14) [<c0109a55>] (show_stack) from [<c056d489>] (dump_stack+0x75/0x84) [<c056d489>] (dump_stack) from [<c0243b53>] (warn_alloc+0xa3/0x104) [<c0243b53>] (warn_alloc) from [<c024473b>] (__alloc_pages_nodemask+0xb87/0xc40) [<c024473b>] (__alloc_pages_nodemask) from [<c02267c5>] (kmalloc_order+0x19/0x38) [<c02267c5>] (kmalloc_order) from [<c02267fd>] (kmalloc_order_trace+0x19/0x90) [<c02267fd>] (kmalloc_order_trace) from [<c047c665>] (zstd_init_decompress_ctx+0x21/0x88) [<c047c665>] (zstd_init_decompress_ctx) from [<c047e9cf>] (f2fs_decompress_pages+0x97/0x228) [<c047e9cf>] (f2fs_decompress_pages) from [<c045d0ab>] (__read_end_io+0xfb/0x130) [<c045d0ab>] (__read_end_io) from [<c045d141>] (f2fs_post_read_work+0x61/0x84) [<c045d141>] (f2fs_post_read_work) from [<c0130b2f>] (process_one_work+0x15f/0x3b0) [<c0130b2f>] (process_one_work) from [<c0130e7b>] (worker_thread+0xfb/0x3e0) [<c0130e7b>] (worker_thread) from [<c0135c3b>] (kthread+0xeb/0x10c) [<c0135c3b>] (kthread) from [<c0100159>] zstd may allocate large size memory for {,de}compression, it may cause file copy failure on low-end device which has very few memory. For decompression, let's just allocate proper size memory based on current file's cluster size instead of max cluster size. Reported-by: 5kft <5kft@5kft.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change compr_blocks of superblock info to 64bitDaeho Jeong
Current compr_blocks of superblock info is not 64bit value. We are accumulating each i_compr_blocks count of inodes to this value and those are 64bit values. So, need to change this to 64bit value. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: add block address limit check to compressed fileDaeho Jeong
Need to add block address range check to compressed file case and avoid calling get_data_block_bmap() for compressed file. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: check position in move range ioctlDan Robertson
When the move range ioctl is used, check the input and output position and ensure that it is a non-negative value. Without this check f2fs_get_dnode_of_data may hit a memmory bug. Signed-off-by: Dan Robertson <dan@dlrobertson.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: correct statistic of APP_DIRECT_IO/APP_DIRECT_READ_IOJack Qiu
Miss to update APP_DIRECT_IO/APP_DIRECT_READ_IO when receiving async DIO. For example: fio -filename=/data/test.0 -bs=1m -ioengine=libaio -direct=1 -name=fill -size=10m -numjobs=1 -iodepth=32 -rw=write Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: Simplify SEEK_DATA implementationMatthew Wilcox (Oracle)
Instead of finding the first dirty page and then seeing if it matches the index of a block that is NEW_ADDR, delay the lookup of the dirty bit until we've actually found a block that's NEW_ADDR. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: support age threshold based garbage collectionChao Yu
There are several issues in current background GC algorithm: - valid blocks is one of key factors during cost overhead calculation, so if segment has less valid block, however even its age is young or it locates hot segment, CB algorithm will still choose the segment as victim, it's not appropriate. - GCed data/node will go to existing logs, no matter in-there datas' update frequency is the same or not, it may mix hot and cold data again. - GC alloctor mainly use LFS type segment, it will cost free segment more quickly. This patch introduces a new algorithm named age threshold based garbage collection to solve above issues, there are three steps mainly: 1. select a source victim: - set an age threshold, and select candidates beased threshold: e.g. 0 means youngest, 100 means oldest, if we set age threshold to 80 then select dirty segments which has age in range of [80, 100] as candiddates; - set candidate_ratio threshold, and select candidates based the ratio, so that we can shrink candidates to those oldest segments; - select target segment with fewest valid blocks in order to migrate blocks with minimum cost; 2. select a target victim: - select candidates beased age threshold; - set candidate_radius threshold, search candidates whose age is around source victims, searching radius should less than the radius threshold. - select target segment with most valid blocks in order to avoid migrating current target segment. 3. merge valid blocks from source victim into target victim with SSR alloctor. Test steps: - create 160 dirty segments: * half of them have 128 valid blocks per segment * left of them have 384 valid blocks per segment - run background GC Benefit: GC count and block movement count both decrease obviously: - Before: - Valid: 86 - Dirty: 1 - Prefree: 11 - Free: 6001 (6001) GC calls: 162 (BG: 220) - data segments : 160 (160) - node segments : 2 (2) Try to move 41454 blocks (BG: 41454) - data blocks : 40960 (40960) - node blocks : 494 (494) IPU: 0 blocks SSR: 0 blocks in 0 segments LFS: 41364 blocks in 81 segments - After: - Valid: 87 - Dirty: 0 - Prefree: 4 - Free: 6008 (6008) GC calls: 75 (BG: 76) - data segments : 74 (74) - node segments : 1 (1) Try to move 12813 blocks (BG: 12813) - data blocks : 12544 (12544) - node blocks : 269 (269) IPU: 0 blocks SSR: 12032 blocks in 77 segments LFS: 855 blocks in 2 segments Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: point man pages for some f2fs utilsJaegeuk Kim
This patch adds some missing contexts related to f2fs-tools in f2fs documentation. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: Use generic casefolding supportDaniel Rosenberg
This switches f2fs over to the generic support provided in the previous patch. Since casefolded dentries behave the same in ext4 and f2fs, we decrease the maintenance burden by unifying them, and any optimizations will immediately apply to both. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10fs: Add standard casefolding supportDaniel Rosenberg
This adds general supporting functions for filesystems that use utf8 casefolding. It provides standard dentry_operations and adds the necessary structures in struct super_block to allow this standardization. The new dentry operations are functionally equivalent to the existing operations in ext4 and f2fs, apart from the use of utf8_casefold_hash to avoid an allocation. By providing a common implementation, all users can benefit from any optimizations without needing to port over improvements. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10unicode: Add utf8_casefold_hashDaniel Rosenberg
This adds a case insensitive hash function to allow taking the hash without needing to allocate a casefolded copy of the string. The existing d_hash implementations for casefolding allocate memory within rcu-walk, by avoiding it we can be more efficient and avoid worrying about a failed allocation. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: compress: use more readable atomic_t type for {cic,dic}.refChao Yu
refcount_t type variable should never be less than one, so it's a little bit hard to understand when we use it to indicate pending compressed page count, let's change to use atomic_t for better readability. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: fix compile warningChao Yu
This patch fixes below compile warning reported by LKP (kernel test robot) cppcheck warnings: (new ones prefixed by >>) >> fs/f2fs/file.c:761:9: warning: Identical condition 'err', second condition is always false [identicalConditionAfterEarlyExit] return err; ^ fs/f2fs/file.c:753:6: note: first condition if (err) ^ fs/f2fs/file.c:761:9: note: second condition return err; Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: support 64-bits key in f2fs rb-tree node entryChao Yu
then, we can add specified entry into rb-tree with 64-bits segment time as key. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: inherit mtime of original block during GCChao Yu
Don't let f2fs inner GC ruins original aging degree of segment. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: record average update time of segmentChao Yu
Previously, once we update one block in segment, we will update mtime of segment to last time, making aged segment becoming freshest, result in that GC with cost benefit algorithm missing such segment, So this patch changes to record mtime as average block updating time instead of last updating time. It's not needed to reset mtime for prefree segment, as se->valid_blocks is zero, then old se->mtime won't take any weight with below calculation: se->mtime = div_u64(se->mtime * se->valid_blocks + mtime, se->valid_blocks + 1); Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: introduce inmem cursegChao Yu
Previous implementation of aligned pinfile allocation will: - allocate new segment on cold data log no matter whether last used segment is partially used or not, it makes IOs more random; - force concurrent cold data/GCed IO going into warm data area, it can make a bad effect on hot/cold data separation; In this patch, we introduce a new type of log named 'inmem curseg', the differents from normal curseg is: - it reuses existed segment type (CURSEG_XXX_NODE/DATA); - it only exists in memory, its segno, blkofs, summary will not b persisted into checkpoint area; With this new feature, we can enhance scalability of log, special allocators can be created for purposes: - pure lfs allocator for aligned pinfile allocation or file defragmentation - pure ssr allocator for later feature So that, let's update aligned pinfile allocation to use this new inmem curseg fwk. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: compress: remove unneeded codeChao Yu
- f2fs_write_multi_pages - f2fs_compress_pages - init_compress_ctx - compress_pages - destroy_compress_ctx --- 1 - f2fs_write_compressed_pages - destroy_compress_ctx --- 2 destroy_compress_ctx() in f2fs_write_multi_pages() is redundant, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: remove duplicated type castingXiaojun Wang
Since DUMMY_WRITTEN_PAGE and ATOMIC_WRITTEN_PAGE have already been converted as unsigned long type, we don't need do type casting again. Signed-off-by: Xiaojun Wang <wangxiaojun11@huawei.com> Reported-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: support zone capacity less than zone sizeAravind Ramesh
NVMe Zoned Namespace devices can have zone-capacity less than zone-size. Zone-capacity indicates the maximum number of sectors that are usable in a zone beginning from the first sector of the zone. This makes the sectors sectors after the zone-capacity till zone-size to be unusable. This patch set tracks zone-size and zone-capacity in zoned devices and calculate the usable blocks per segment and usable segments per section. If zone-capacity is less than zone-size mark only those segments which start before zone-capacity as free segments. All segments at and beyond zone-capacity are treated as permanently used segments. In cases where zone-capacity does not align with segment size the last segment will start before zone-capacity and end beyond the zone-capacity of the zone. For such spanning segments only sectors within the zone-capacity are used. During writes and GC manage the usable segments in a section and usable blocks per segment. Segments which are beyond zone-capacity are never allocated, and do not need to be garbage collected, only the segments which are before zone-capacity needs to garbage collected. For spanning segments based on the number of usable blocks in that segment, write to blocks only up to zone-capacity. Zone-capacity is device specific and cannot be configured by the user. Since NVMe ZNS device zones are sequentially write only, a block device with conventional zones or any normal block device is needed along with the ZNS device for the metadata operations of F2fs. A typical nvme-cli output of a zoned device shows zone start and capacity and write pointer as below: SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones are in EMPTY state. For each zone, only zone start + 49MB is usable area, any lba/sector after 49MB cannot be read or written to, the drive will fail any attempts to read/write. So, the second zone starts at 64MB and is usable till 113MB (64 + 49) and the range between 113 and 128MB is again unusable. The next zone starts at 128MB, and so on. Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10Merge tag 'f2fs-for-5.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "Small bug fixes for: - SMR drive fix - infinite loop when building free node ids - EOF at DIO read" * tag 'f2fs-for-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: Return EOF on unaligned end of file DIO read f2fs: fix indefinite loop scanning for free nid f2fs: Fix type of section block count variables
2020-09-09Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a regression in padata" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: padata: fix possible padata_works_lock deadlock
2020-09-09Merge tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Fix an NFS/RDMA resource leak - Fix the error handling during delegation recall - NFSv4.0 needs to return the delegation on a zero-stateid SETATTR - Stop printk reading past end of string * tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: stop printk reading past end of string NFS: Zero-stateid SETATTR should first return delegation NFSv4.1 handle ERR_DELAY error reclaiming locking state on delegation recall xprtrdma: Release in-flight MRs on disconnect
2020-09-08f2fs: Return EOF on unaligned end of file DIO readGabriel Krisman Bertazi
Reading past end of file returns EOF for aligned reads but -EINVAL for unaligned reads on f2fs. While documentation is not strict about this corner case, most filesystem returns EOF on this case, like iomap filesystems. This patch consolidates the behavior for f2fs, by making it return EOF(0). it can be verified by a read loop on a file that does a partial read before EOF (A file that doesn't end at an aligned address). The following code fails on an unaligned file on f2fs, but not on btrfs, ext4, and xfs. while (done < total) { ssize_t delta = pread(fd, buf + done, total - done, off + done); if (!delta) break; ... } It is arguable whether filesystems should actually return EOF or -EINVAL, but since iomap filesystems support it, and so does the original DIO code, it seems reasonable to consolidate on that. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>