Age | Commit message (Collapse) | Author |
|
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Replace another case where the layout 'plh_block_lgets' can trigger
infinite loops in send_layoutget().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If the server reboots while there is a layoutget outstanding, then
the call to pnfs_choose_layoutget_stateid() will fail with an EAGAIN
error, which causes an infinite loop in send_layoutget(). The reason
why we never break out of the loop is that the layout 'plh_block_lgets'
field is never cleared.
Fix is to replace plh_block_lgets with NFS_LAYOUT_INVALID_STID, which
can be reset after a new layoutget.
Fixes: ab7d763e477c5 ("pNFS: Ensure nfs4_layoutget_prepare returns...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
When setting the layout return mode, we must always also set the
NFS_LAYOUT_RETURN_REQUESTED flag to ensure that we send a layoutreturn.
Otherwise pnfs_error_mark_layout_for_return() could set the mode, but
fail to send the layoutreturn because another is already in flight.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We don't need to schedule a layoutreturn if the layout segment can
be freed immediately.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
NFS_LAYOUT_RETURN_BEFORE_CLOSE is being used to signal that a
layoutreturn is needed, either due to a layout recall or to a
layout error. Rename it to NFS_LAYOUT_RETURN_REQUESTED in order
to clarify its purpose.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The layoutreturn code currently relies on pnfs_put_lseg() to initiate the
RPC call when conditions are right. A problem arises when we want to
free the layout segment from inside an inode->i_lock section (e.g. in
pnfs_clear_request_commit()), since we cannot sleep.
The workaround is to move the actual call to pnfs_send_layoutreturn()
to pnfs_put_layout_hdr(), which doesn't have this restriction.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
* pnfs_generic:
NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
NFSv4.1/pNFS: Fix a race in initiate_file_draining()
NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
NFS: Relax requirements in nfs_flush_incompatible
NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
NFS: Allow multiple commit requests in flight per file
NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
NFSv4: List stateid information in the callback tracepoints
NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL
NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1
pNFS: If we have to delay the layout callback, mark the layout for return
NFSv4.1/pNFS: Add a helper to mark the layout as returned
pNFS: Ensure nfs4_layoutget_prepare returns the correct error
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Make it more obvious what we're returning...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Fix a bug whereby if all the layout segments could be immediately freed,
the call to pnfs_error_mark_layout_for_return() would never result in
a layoutreturn.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If pnfs_mark_matching_lsegs_return() needs to mark a layout segment for
return, then it must also set the return iomode.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
A stateid is a structure, pass it as a pointer.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If the layout segment is invalid, then we should not be adding more
write requests to the commit list. Instead, those writes should be
replayed after requesting a new layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If the client needs to delay the layout callback, then speed up the recall
process by marking the remaining layout segments to be actively returned
by the client.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
This ensures that we don't reuse the stateid if a layout return or
implied layout return means that we've returned all layout segments
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
In pNFS/flexfiles, we want to return the layout without necessarily marking
it as having completely failed. We therefore move the call to
pnfs_layout_io_set_failed() out of pnfs_error_mark_layout_for_return(),
and then ensura that pNFS/files layout calls it separately.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Fix a bug in which flexfiles clients are falling back to I/O through the
MDS even when the FF_FLAGS_NO_IO_THRU_MDS flag is set.
The flexfiles client will always report errors through the LAYOUTRETURN
and/or LAYOUTERROR mechanisms, so it should normally be safe for it
to retry the LAYOUTGET until it fails or succeeds.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
When we fail to queue a read page to IO descriptor,
we need to clean it up otherwise it is hanging around
preventing nfs module from being removed.
When we fail to queue a write page to IO descriptor,
we need to clean it up and also save the failure status
to open context. Then at file close, we can try to write
pages back again and drop the page if it fails to writeback
in .launder_page, which will be done in the next patch.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
In case we fail during setting things up for read/write IO, set
pg_error in IO descriptor and do the cleanup in nfs_pageio_add_request,
where we clean up all pages that are still hanging around on the IO
descriptor.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
For ERESTARTSYS/EIO/EROFS/ENOSPC/E2BIG in layoutget, we
should just bail out instead of hiding the error and
retrying inband IO.
Change all the call sites to pop the error all the way up.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Instead of displaying a layout segment pointer in these tracepoints,
let's use the layout stateid, now that Olga gave us a set of tools for
displaying them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
pnfs_update_layout is really the "nexus" of layout handling. If it
returns NULL then we end up going through the MDS. This patch adds
some tracepoints to that function that allow us to determine the
cause when we end up going through the MDS unexpectedly.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
pnfs_layout_process will check the returned layout stateid against what
the kernel has in-core. If it turns out that the stateid we received is
older, then we should resend the LAYOUTGET instead of falling back to
MDS I/O.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If non rpc-based layout driver return bad length of data, nfs retries
by calling rpc_restart_call_prepare() that cause an NULL reference panic.
This patch lets nfs retry through MDS for non rpc-based layout driver
return bad length of data.
[13034.883329] BUG: unable to handle kernel NULL pointer dereference at (null)
[13034.884902] IP: [<ffffffffa00db372>] rpc_restart_call_prepare+0x62/0x90 [sunrpc]
[13034.886558] PGD 0
[13034.888126] Oops: 0000 [#1] KASAN
[13034.889710] Modules linked in: blocklayoutdriver(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c coretemp btrfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon auth_rpcgss shpchp nfs_acl lockd vmw_vmci parport_pc xor raid6_pq grace parport sunrpc i2c_piix4 vmwgfx drm_kms_helper ttm drm mptspi e1000 serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache]
[13034.898260] CPU: 0 PID: 10112 Comm: kworker/0:1 Tainted: G OE 4.3.0-rc5+ #279
[13034.899932] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[13034.903342] Workqueue: events bl_read_cleanup [blocklayoutdriver]
[13034.905059] task: ffff88006a9148c0 ti: ffff880035e90000 task.ti: ffff880035e90000
[13034.906827] RIP: 0010:[<ffffffffa00db372>] [<ffffffffa00db372>] rpc_restart_call_prepare+0x62/0x90 [sunrpc]
[13034.910522] RSP: 0018:ffff880035e97b58 EFLAGS: 00010282
[13034.912378] RAX: fffffbfff04a5a94 RBX: ffff880068fe4858 RCX: 0000000000000003
[13034.914339] RDX: dffffc0000000000 RSI: 0000000000000003 RDI: 0000000000000282
[13034.916236] RBP: ffff880035e97b68 R08: 0000000000000001 R09: 0000000000000001
[13034.918229] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[13034.920007] R13: ffff880068fe4858 R14: ffff880068fe4a60 R15: 0000000000001000
[13034.921845] FS: 0000000000000000(0000) GS:ffffffff82247000(0000) knlGS:0000000000000000
[13034.923645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13034.925525] CR2: 0000000000000000 CR3: 00000000063dd000 CR4: 00000000001406f0
[13034.932808] Stack:
[13034.934813] ffff880068fe4780 0000000000001000 ffff880035e97ba8 ffffffffa08800d2
[13034.936675] ffffffffa088029d ffff880068fe4780 ffff880068fe4858 ffffffffa089c0a0
[13034.938593] ffff880068fe47e0 ffff88005d59faf0 ffff880035e97be0 ffffffffa087e08f
[13034.940454] Call Trace:
[13034.942388] [<ffffffffa08800d2>] nfs_readpage_result+0x112/0x200 [nfs]
[13034.944317] [<ffffffffa088029d>] ? nfs_readpage_done+0xdd/0x160 [nfs]
[13034.946267] [<ffffffffa087e08f>] nfs_pgio_result+0x9f/0x120 [nfs]
[13034.948166] [<ffffffffa09266cc>] pnfs_ld_read_done+0x7c/0x1e0 [nfsv4]
[13034.950247] [<ffffffffa03b07ee>] bl_read_cleanup+0x2e/0x60 [blocklayoutdriver]
[13034.952156] [<ffffffff810ebf62>] process_one_work+0x412/0x870
[13034.954102] [<ffffffff810ebe84>] ? process_one_work+0x334/0x870
[13034.955949] [<ffffffff810ebb50>] ? queue_delayed_work_on+0x40/0x40
[13034.957985] [<ffffffff810ec441>] worker_thread+0x81/0x6a0
[13034.959817] [<ffffffff810ec3c0>] ? process_one_work+0x870/0x870
[13034.961785] [<ffffffff810f43bd>] kthread+0x17d/0x1a0
[13034.963544] [<ffffffff810f4240>] ? kthread_create_on_node+0x330/0x330
[13034.965479] [<ffffffff81100428>] ? finish_task_switch+0x88/0x220
[13034.967223] [<ffffffff810f4240>] ? kthread_create_on_node+0x330/0x330
[13034.968929] [<ffffffff81b6ae5f>] ret_from_fork+0x3f/0x70
[13034.970534] [<ffffffff810f4240>] ? kthread_create_on_node+0x330/0x330
[13034.972176] Code: c7 43 50 40 84 0d a0 e8 3d fe 1c e1 48 8d 7b 58 c7 83 e4 00 00 00 00 00 00 00 e8 ca fe 1c e1 4c 8b 63 58 4c 89 e7 e8 be fe 1c e1 <49> 83 3c 24 00 74 12 48 c7 43 50 f0 a2 0e a0 b8 01 00 00 00 5b
[13034.977148] RIP [<ffffffffa00db372>] rpc_restart_call_prepare+0x62/0x90 [sunrpc]
[13034.978780] RSP <ffff880035e97b58>
[13034.980399] CR2: 0000000000000000
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If we send a layoutreturn asynchronously before close, the close
might reach server first and layoutreturn would fail with BADSTATEID
because there is nothing keeping the layout stateid alive.
Also do not pretend sending layoutreturn if we are not.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If we have a read layout, then sanity check the minimal layout length
so that it does not extend beyond the end of file.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
It's not sufficient to just mark the layout segment for layout return. We
also need to set the NFS_LAYOUT_RETURN_BEFORE_CLOSE flag in the layout header.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
This is needed in order to allow merging of contiguous layout segments,
and also to correct the ordering of layouts for those device drivers that
don't necessarily want to place the read-write layouts first.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Allow advanced users to set the layoutstats timer in order to lengthen
or shorten the period between layoutstat transmissions to the server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If we return delegation before closing, we fail to do roc check
during close because NFS_LAYOUT_ROC is cleared by delegreturn
and it causes layouts to be still hanging around after delegreturn
+ close, which is a voilation against protocol.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Allow tracing of return-on-close.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If layout is marked by NFS_LAYOUT_RETURN_BEFORE_CLOSE, we should always
send LAYOUTRETURN before close, and we don't need to do ROC drain if we
do send LAYOUTRETURN.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The helper pnfs_roc() has already verified that we have no delegations,
and no further open files, hence no outstanding I/O and it has marked
all the return-on-close lsegs as being invalid.
Furthermore, it sets the NFS_LAYOUT_RETURN bit, thus serialising the
close/delegreturn with all future layoutget calls on this inode.
The checks in pnfs_roc_drain() for valid layout segments are therefore
redundant: those cannot exist until another layoutget completes.
The other check for whether or not NFS_LAYOUT_RETURN is set, actually
causes a hang, since we already know that we hold that flag.
To fix, we therefore strip out all the functionality in pnfs_roc_drain()
except the retrieval of the barrier state, and then rename the function
accordingly.
Reported-by: Christoph Hellwig <hch@infradead.org>
Fixes: 5c4a79fb2b1c ("Don't prevent layoutgets when doing return-on-close")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
pnfs_clear_layoutreturn_waitbit() should already be calling
rpc_wake_up(&NFS_SERVER(ino)->roc_rpcwaitq) for us.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
layoutget now should already be serialised w.r.t. layout returns
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The NFS_LAYOUT_RETURN bit already suffices to ensure that layoutget
is blocked.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If there is an outstanding return-on-close, then we just want new
layoutget requests to wait rather than fail.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We should always test for outstanding layout returns, whether or not
pnfs_should_retry_layoutget() is true.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If there are no valid layout segments, then we should already have
checked in pnfs_update_layout() whether or not this is the first
layoutget.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Prevent a potential deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
"data" is currently leaked when the prepare_layoutcommit operation
returns an error. Put the cred before taking the spinlock in that
case, take the lock and then goto out_unlock which will drop the
lock and then free "data".
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
It is OK for layout segments to remain hashed even if no-one holds any
references to them, provided that the segments are still valid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If a process reopens the file before we can send off the CLOSE/DELEGRETURN,
then pnfs_roc_drain() may end up waiting for a new set of layout segments
that are marked as return-on-close, but haven't yet been returned.
Fix this by only waiting for those layout segments that were invalidated in
pnfs_roc().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|