The flexfiles layout in particular, seems to want to poke around in the
O_DIRECT flags when retransmitting.
This patch sets up an interface to allow it to call back into O_DIRECT
to handle retransmission correctly. It also fixes a potential bug whereby
we could change the behaviour of O_DIRECT if an error is already pending.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we fail to queue a read page to IO descriptor,
we need to clean it up otherwise it is hanging around
preventing nfs module from being removed.
When we fail to queue a write page to IO descriptor,
we need to clean it up and also save the failure status
to open context. Then at file close, we can try to write
pages back again and drop the page if it fails to writeback
in .launder_page, which will be done in the next patch.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Operations to which stateid information is added:
close, delegreturn, open, read, setattr, layoutget, layoutcommit, test_stateid,
write, lock, locku, lockt
Format is "stateid=<seqid>:<crc32 hash stateid.other>", also "openstateid=",
"layoutstateid=", and "lockstateid=" for open_file, layoutget, set_lock
tracepoints.
New function is added to internal.h, nfs_stateid_hash(), to compute the hash
trace_nfs4_setattr() is moved from nfs4_do_setattr() to _nfs4_do_setattr()
to get access to stateid.
trace_nfs4_setattr and trace_nfs4_delegreturn are changed from INODE_EVENT
to new event type, INODE_STATEID_EVENT which is same as INODE_EVENT but adds
stateid information
for locking tracepoints, moved trace_nfs4_set_lock() into _nfs4_do_setlk()
to get access to stateid information, and removed trace_nfs4_lock_reclaim(),
trace_nfs4_lock_expired() as they call into _nfs4_do_setlk() and both were
previously same LOCK_EVENT type.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv4 delegation spec allows the server to tell a client to limit how
much data it cache after the file is closed. In return, the server
guarantees enough free space to avoid ENOSPC situations, etc.
Prior to this patch, we assumed we could always cache aggressively after
close. Unfortunately, this causes problems with servers that set the
limit to 0 and therefore do not offer any ENOSPC guarantees.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
They already exist and do the exact same thing. Let's save ourselves
several lines of code!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
pnfs_layout_mark_request_commit() needs to ensure that it adds the
request to the commit list atomically with all the other updates
in order to prevent corruption to buckets[ds_commit_idx].wlseg
due to races with pnfs_generic_clear_request_commit().
Fixes: 338d00cfef ("pnfs: Refactor the *_layout_mark_request_commit...")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Highlights include:
Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce
Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory availability
- Don't revalidate the mapping if both size and change attr are up to date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits correctly
- Several pNFS layout return bugfixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVt6RGAAoJEGcL54qWCgDyiDIP/2+fUM7Tc1llCxYbM2WLC6Ar
34v5yVwO96MqhI4L2mXB5FJvr4LP2/EZ4ZExMcf4ymT7pgJnjFK4nEv9IHUSy6xb
ea+oS9GjvFSeGdkukJLRniNER5/ZG3GWkojlHNJCgByoIVRK4ISXF/qL9w2sedGw
+5ejvjqie9NmBnBXMq8DRlU+kXhVYCF6E9qWATwUNK5Eq2eeQnDbA2w9ACSBVK3W
LhCvZi0eBq7krSbHob018PmlQ0VPvmYwk5xL4d//FvcaNj/utk82VjAZCdKOK1sH
qn8hcKgVeVko/3jwcUp6m3zAkKZ1IX/XaXJeHbosnKG/g0vy3hQirpa/g2iDTQ4H
NXOSwcsd6syReZDZbQTxbvaSOp5ACxZAQKYLnlPerJ/hMpXDQCEAwyeAFKzEaKz4
FfF0VJF+30w9PJk3wgk2DF66xbYVfHyvrLtVcb/ki8gb91cH09i+nFFSSfHQBMLh
+ciHg7rOyXnbXoCaW9fBvONz2sCYDwbHATmhpWWZIx/3UTDf5owxHFa3BFDgGKnD
jyiPjMh6I3JUE+Qm1zwInsfsskBKRSl2BdJgTHBGY5ODuQGF/sogOmvgbrT7Ox3t
kbL8nzCydqLixM+4aw61nYakZqgDsKNER5Ggr+lkv4AZ2dH6IeP2IZjuoHLLylvZ
dyqHwpCjoUtmYAUr166U
=wlUD
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce
Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory
availability
- Don't revalidate the mapping if both size and change attr are up to
date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits
correctly
- Several pNFS layout return bugfixes"
* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
nfs: plug memory leak when ->prepare_layoutcommit fails
SUNRPC: Report TCP errors to the caller
sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
NFSv4.2: handle NFS-specific llseek errors
NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
NFS: Fix a memory leak in nfs_do_recoalesce
NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
NFS: Don't revalidate the mapping if both size and change attr are up to date
NFSv4/pnfs: Ensure we don't miss a file extension
NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
SUNRPC: xprt_complete_bc_request must also decrement the free slot count
SUNRPC: Fix a backchannel deadlock
pNFS: Don't throw out valid layout segments
pNFS: pnfs_roc_drain() fix a race with open
pNFS: Fix races between return-on-close and layoutreturn.
pNFS: pnfs_roc_drain should return 'true' when sleeping
pNFS: Layoutreturn must invalidate all existing layout segments.
...
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->bdi_stat[] into wb.
* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
enums is changed from BDI_ to WB_.
* BDI_STAT_BATCH() -> WB_STAT_BATCH()
* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
* bdi_stat[_error]() -> wb_stat[_error]()
* bdi_writeout_inc() -> wb_writeout_inc()
* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
frees stat.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ensure that other operations that race with our write RPC calls
cannot revert the file size updates that were made on the server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
If we have to do a return-on-close in the delegreturn code, then
we must ensure that the inode and super block remain referenced.
Cc: Peng Tao <tao.peng@primarydata.com>
Cc: stable@vger.kernel.org # 3.17.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
* flexfiles: (53 commits)
pnfs: lookup new lseg at lseg boundary
nfs41: .init_read and .init_write can be called with valid pg_lseg
pnfs: Update documentation on the Layout Drivers
pnfs/flexfiles: Add the FlexFile Layout Driver
nfs: count DIO good bytes correctly with mirroring
nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET
nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes
nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags
nfs/flexfiles: send layoutreturn before freeing lseg
nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE
nfs41: allow async version layoutreturn
nfs41: add range to layoutreturn args
pnfs: allow LD to ask to resend read through pnfs
nfs: add nfs_pgio_current_mirror helper
nfs: only reset desc->pg_mirror_idx when mirroring is supported
nfs41: add a debug warning if we destroy an unempty layout
pnfs: fail comparison when bucket verifier not set
nfs: mirroring support for direct io
nfs: add mirroring support to pgio layer
pnfs: pass ds_commit_idx through the commit path
...
Conflicts:
fs/nfs/pnfs.c
fs/nfs/pnfs.h
Let it return current nfs_pgio_mirror in use depending on pg_mirror_count.
For read, we always use pg_mirrors[0], so this effectively gives us freedom
to use pg_mirror_idx to track the actual mirror to read from through out the
IO stack.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
so that we don't reset desc->pg_mirror_idx for read unnecessarily.
Remove WARN_ON_ONCE from __nfs_pageio_add_request to allow LD to
set pg_mirror_idx for read where pg_mirror_count is always 1.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
This patch adds mirrored write support to the pgio layer. The default
is to use one mirror, but pgio callers may define callbacks to change
this to any value up to the (arbitrarily selected) limit of 16.
The basic idea is to break out members of nfs_pageio_descriptor that cannot
be shared between mirrored DSes and put them in a new structure.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Pass ds_commit_idx through the nfs commit path. It's used to select
the commit bucket when using pnfs and is ignored when not using pnfs.
Several functions had to be changed: nfs_retry_commit,
nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout
driver .mark_request_commit functions.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
so that flexfile layout client can pass in DS credential instead of
using user cred, which will be done in the next patch.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
pnfs flexfile layout client may want to use NFSv3 ops rather
than the default MDS v4 ops.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
flexfile layout may need to set such when making DS connections.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
The flexfiles layout wants to create DS connection over NFSv3.
Add nfs3_set_ds_client to allow that to happen.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
flexfile layout may use different auth flavor as specified by MDS.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
This function call was being optimized out during nfs_fhget(), leading
to situations where we have a valid fileid but still want to use the
mounted_on_fileid. For example, imagine we have our server configured
like this:
server % df
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 9.1G 6.5G 1.9G 78% /
/dev/vdb1 487M 2.3M 456M 1% /exports
/dev/vdc1 487M 2.3M 456M 1% /exports/vol1
/dev/vdd1 487M 2.3M 456M 1% /exports/vol2
If our client mounts /exports and tries to do a "chown -R" across the
entire mountpoint, we will get a nasty message warning us about a circular
directory structure. Running chown with strace tells me that each directory
has the same device and inode number:
newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
With this patch the mounted_on_fileid values are used for st_ino, so the
directory loop warning isn't reported.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.
Addintionally remove the call during mount failure, as
deactivate_super_locked will already call ->kill_sb and clean up
the bdi for us.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUNZK4AAoJEAAOaEEZVoIVI08P/iM7eaIVRnqaqtWw/JBzxiba
EMDlJYUBSlv6lYk9s8RJT4bMmcmGAKSYzVAHSoPahzNcqTDdFLeDTLGxJ8uKBbjf
d1qRRdH1yZHGUzCvJq3mEendjfXn435Y3YburUxjLfmzrzW7EbMvndiQsS5dhAm9
PEZ+wrKF/zFL7LuXa1YznYrbqOD/GRsJAXGEWc3kNwfS9avephVG/RI3GtpI2PJj
RY1mf8P7+WOlrShYoEuUo5aqs01MnU70LbqGHzY8/QKH+Cb0SOkCHZPZyClpiA+G
MMJ+o2XWcif3BZYz+dobwz/FpNZ0Bar102xvm2E8fqByr/T20JFjzooTKsQ+PtCk
DetQptrU2gtyZDKtInJUQSDPrs4cvA13TW+OEB1tT8rKBnmyEbY3/TxBpBTB9E6j
eb/V3iuWnywR3iE+yyvx24Qe7Pov6deM31s46+Vj+GQDuWmAUJXemhfzPtZiYpMT
exMXTyDS3j+W+kKqHblfU5f+Bh1eYGpG2m43wJVMLXKV7NwDf8nVV+Wea962ga+w
BAM3ia4JRVgRWJBPsnre3lvGT5kKPyfTZsoG+kOfRxiorus2OABoK+SIZBZ+c65V
Xh8VH5p3qyCUBOynXlHJWFqYWe2wH0LfbPrwe9dQwTwON51WF082EMG5zxTG0Ymf
J2z9Shz68zu0ok8cuSlo
=Hhee
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux
Pull file locking related changes from Jeff Layton:
"This release is a little more busy for file locking changes than the
last:
- a set of patches from Kinglong Mee to fix the lockowner handling in
knfsd
- a pile of cleanups to the internal file lease API. This should get
us a bit closer to allowing for setlease methods that can block.
There are some dependencies between mine and Bruce's trees this cycle,
and I based my tree on top of the requisite patches in Bruce's tree"
* tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
locks: set fl_owner for leases to filp instead of current->files
locks: give lm_break a return value
locks: __break_lease cleanup in preparation of allowing direct removal of leases
locks: remove i_have_this_lease check from __break_lease
locks: move freeing of leases outside of i_lock
locks: move i_lock acquisition into generic_*_lease handlers
locks: define a lm_setup handler for leases
locks: plumb a "priv" pointer into the setlease routines
nfsd: don't keep a pointer to the lease in nfs4_file
locks: clean up vfs_setlease kerneldoc comments
locks: generic_delete_lease doesn't need a file_lock at all
nfsd: fix potential lease memory leak in nfs4_setlease
locks: close potential race in lease_get_mtime
security: make security_file_set_fowner, f_setown and __f_setown void return
locks: consolidate "nolease" routines
locks: remove lock_may_read and lock_may_write
lockd: rip out deferred lock handling from testlock codepath
NFSD: Get reference of lockowner when coping file_lock
...
I am generally against the "one big header file" approach, and
everything in the client includes this file. Let's move all the NFS v3
declarations into a v3-only header file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
GFS2 and NFS have setlease routines that always just return -EINVAL.
Turn that into a generic routine that can live in fs/libfs.c.
Cc: <linux-nfs@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: <cluster-devel@redhat.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Highlights include:
- Stable fix for a bug in nfs3_list_one_acl()
- Speed up NFS path walks by supporting LOOKUP_RCU
- More read/write code cleanups
- pNFS fixes for layout return on close
- Fixes for the RCU handling in the rpcsec_gss code
- More NFS/RDMA fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1
TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY
mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL
RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg
nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A
OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh
f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7
gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR
t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo
VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF
NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc
ivr2s7aIDV48jhnwBSVL
=KE07
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
- stable fix for a bug in nfs3_list_one_acl()
- speed up NFS path walks by supporting LOOKUP_RCU
- more read/write code cleanups
- pNFS fixes for layout return on close
- fixes for the RCU handling in the rpcsec_gss code
- more NFS/RDMA fixes"
* tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
nfs: reject changes to resvport and sharecache during remount
NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
NFS: fix two problems in lookup_revalidate in RCU-walk
NFS: allow lockless access to access_cache
NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
NFS: support RCU_WALK in nfs_permission()
sunrpc/auth: allow lockless (rcu) lookup of credential cache.
NFS: prepare for RCU-walk support but pushing tests later in code.
NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
NFS: add checks for returned value of try_module_get()
nfs: clear_request_commit while holding i_lock
pnfs: add pnfs_put_lseg_async
pnfs: find swapped pages on pnfs commit lists too
nfs: fix comment and add warn_on for PG_INODE_REF
nfs: check wait_on_bit_lock err in page_group_lock
sunrpc: remove "ec" argument from encrypt_v2 operation
sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
...
Pull namespace updates from Eric Biederman:
"This is a bunch of small changes built against 3.16-rc6. The most
significant change for users is the first patch which makes setns
drmatically faster by removing unneded rcu handling.
The next chunk of changes are so that "mount -o remount,.." will not
allow the user namespace root to drop flags on a mount set by the
system wide root. Aks this forces read-only mounts to stay read-only,
no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
mounts to stay no exec and it prevents unprivileged users from messing
with a mounts atime settings. I have included my test case as the
last patch in this series so people performing backports can verify
this change works correctly.
The next change fixes a bug in NFS that was discovered while auditing
nsproxy users for the first optimization. Today you can oops the
kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
with pid namespaces. I rebased and fixed the build of the
!CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
that no one to my knowledge bases anything on my tree fixing the typo
in place seems more responsible that requiring a typo-fix to be
backported as well.
The last change is a small semantic cleanup introducing
/proc/thread-self and pointing /proc/mounts and /proc/net at it. This
prevents several kinds of problemantic corner cases. It is a
user-visible change so it has a minute chance of causing regressions
so the change to /proc/mounts and /proc/net are individual one line
commits that can be trivially reverted. Unfortunately I lost and
could not find the email of the original reporter so he is not
credited. From at least one perspective this change to /proc/net is a
refgression fix to allow pthread /proc/net uses that were broken by
the introduction of the network namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
proc: Implement /proc/thread-self to point at the directory of the current thread
proc: Have net show up under /proc/<tgid>/task/<tid>
NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
namespaces: Use task_lock and not rcu to protect nsproxy
The usage of pid_ns->child_reaper->nsproxy->net_ns in
nfs_server_list_open and nfs_client_list_open is not safe.
/proc for a pid namespace can remain mounted after the all of the
process in that pid namespace have exited. There are also times
before the initial process in a pid namespace has started or after the
initial process in a pid namespace has exited where
pid_ns->child_reaper can be NULL or stale. Making the idiom
pid_ns->child_reaper->nsproxy a double whammy of problems.
Luckily all that needs to happen is to move /proc/fs/nfsfs/servers and
/proc/fs/nfsfs/volumes under /proc/net to /proc/net/nfsfs/servers and
/proc/net/nfsfs/volumes and add a symlink from the original location,
and to use seq_open_net as it has been designed.
Cc: stable@vger.kernel.org
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It is currently not possible for various wait_on_bit functions
to implement a timeout.
While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.
The 'action' function is currently passed a pointer to the word
containing the bit being waited on. No current action functions
use this pointer. So changing it to something else will be a
little noisy but will have no immediate effect.
This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.
It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.
An action function can now implement a timeout with something
like
static int timed_out_waiter(struct wait_bit_key *key)
{
unsigned long waited;
if (key->private == 0) {
key->private = jiffies;
if (key->private == 0)
key->private -= 1;
}
waited = jiffies - key->private;
if (waited > 10 * HZ)
return -EAGAIN;
schedule_timeout(waited - 10 * HZ);
return 0;
}
If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".
My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.
While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need. I need the timeout to be sensitive to
the state of the connection with the server, which could change.
So I need to use an 'action' interface.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* bugfixes:
NFS: Don't reset pg_moreio in __nfs_pageio_add_request
NFS: Remove 2 unused variables
nfs: handle multiple reqs in nfs_wb_page_cancel
nfs: handle multiple reqs in nfs_page_async_flush
nfs: change find_request to find_head_request
nfs: nfs_page should take a ref on the head req
nfs: mark nfs_page reqs with flag for extra ref
nfs: only show Posix ACLs in listxattr if actually present
Conflicts:
fs/nfs/write.c
Change nfs_find_and_lock_request so nfs_page_async_flush can handle multiple
requests in a page. There is only one request for a page the first time
nfs_page_async_flush is called, but if a write or commit fails, async_flush
is called again and there may be multiple requests associated with the page.
The solution is to merge all the requests in a page group into a single
request before calling nfs_pageio_add_request.
Rename nfs_find_and_lock_request to nfs_lock_and_join_requests and
change it to first lock all requests for the page, then cancel and merge
all subrequests into the head request.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Remove duplicate writeverf structure from merge of nfs_pgio_header and
nfs_pgio_data and remove writeverf related flags and logic to handle
more than one RPC per nfs_pgio_header.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
struct nfs_pgio_data only exists as a member of nfs_pgio_header, but is
passed around everywhere, because there used to be multiple _data structs
per _header. Many of these functions then use the _data to find a pointer
to the _header. This patch cleans this up by merging the nfs_pgio_data
structure into nfs_pgio_header and passing nfs_pgio_header around instead.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_rw_header was used to allocate an nfs_pgio_header along with an
nfs_pgio_data, because a _header would need at least one _data.
Now there is only ever one nfs_pgio_data for each nfs_pgio_header -- move
it to nfs_pgio_header and get rid of nfs_rw_header.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
At this point the read and write structures look identical, so combine
them into something shared by both.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
What we have here is two functions that look identical. Let's share
some more code!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Once again, these two functions look identical in the read and write
case. Time to combine them together!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Most of this code is the same for both the read and write paths, so
combine everything and use the rw_ops when necessary.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
These functions are almost identical on both the read and write side.
FLUSH_COND_STABLE will never be set for the read path, so leaving it in
the generic code won't hurt anything.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the read and write versions of this function look
identical so both should use the same function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Write adds a little bit of code dealing with flush flags, but since
"how" will always be 0 when reading we can share the code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths set up this struct in exactly the same way, so
create a single shared struct.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Combining these functions will let me make a single nfs_rw_common_ops
struct (see the next patch).
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths do exactly the same thing for the rpc_prepare
rpc_op. This patch combines them together into a single function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I create a new struct nfs_rw_ops to decide the differences between reads
and writes. This struct will be set when initializing a new
nfs_pgio_descriptor, and then passed on to the nfs_rw_header when a new
header is allocated.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
These functions are identical for the read and write paths so they can
be combined.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The only difference is the write verifier field, but we can keep that
for a little bit longer.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the only difference between nfs_read_data and
nfs_write_data is the write verifier.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_read based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The write_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_write based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
...and move the prototype for nfs_sillyrename to internal.h.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to use the same net namespace that was used to resolve
the hostname and sockaddr arguments.
Fixes: 32e62b7c3e (NFS: Add nfs4_update_server)
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Try to detect 'ls -l' by having nfs_getattr() look at whether or not
there is an opendir() file descriptor for the parent directory.
If so, then assume that we want to force use of readdirplus in order
to avoid the multiple GETATTR calls over the wire.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit aa9c266962 (NFS: Client implementation of Labeled-NFS) introduces
a performance regression. When nfs_zap_caches_locked is called, it sets
the NFS_INO_INVALID_LABEL flag irrespectively of whether or not the
NFS server supports security labels. Since that flag is never cleared,
it means that all calls to nfs_revalidate_inode() will now trigger
an on-the-wire GETATTR call.
This patch ensures that we never set the NFS_INO_INVALID_LABEL unless the
server advertises support for labeled NFS.
It also causes nfs_setsecurity() to clear NFS_INO_INVALID_LABEL when it
has successfully set the security label for the inode.
Finally it gets rid of the NFS_INO_INVALID_LABEL cruft from nfs_update_inode,
which has nothing to do with labeled NFS.
Reported-by: Neil Brown <neilb@suse.de>
Cc: stable@vger.kernel.org # 3.11+
Tested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When CONFIG_NFS_V4_2 is toggled nfsd and lockd will be recompiled,
instead of only the nfs client. This patch moves a small amount of code
into the client directory to avoid unnecessary recompiles.
Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds support for multiple security options which can be
specified using a colon-delimited list of security flavors (the same
syntax as nfsd's exports file).
This is useful, for instance, when NFSv4.x mounts cross SECINFO
boundaries. With this patch a user can use "sec=krb5i,krb5p"
to mount a remote filesystem using krb5i, but can still cross
into krb5p-only exports.
New mounts will try all security options before failing. NFSv4.x
SECINFO results will be compared against the sec= flavors to
find the first flavor in both lists or if no match is found will
return -EPERM.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When filling parsed_mount_data, store the parsed sec= mount option in
the new struct nfs_auth_info and the chosen flavor in selected_flavor.
This patch lays the groundwork for supporting multiple sec= options.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
New function nfs4_update_server() moves an nfs_server to a different
nfs_client. This is done as part of migration recovery.
Though it may be appealing to think of them as the same thing,
migration recovery is not the same as following a referral.
For a referral, the client has not descended into the file system
yet: it has no nfs_server, no super block, no inodes or open state.
It is enough to simply instantiate the nfs_server and super block,
and perform a referral mount.
For a migration, however, we have all of those things already, and
they have to be moved to a different nfs_client. No local namespace
changes are needed here.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Convert the filesystem shrinkers to use the new API, and standardise some
of the behaviours of the shrinkers at the same time. For example,
nr_to_scan means the number of objects to scan, not the number of objects
to free.
I refactored the CIFS idmap shrinker a little - it really needs to be
broken up into a shrinker per tree and keep an item count with the tree
root so that we don't need to walk the tree every time the shrinker needs
to count the number of objects in the tree (i.e. all the time under
memory pressure).
[glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
[assorted fixes folded in]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NFSv4 security auto-negotiation has been broken since
commit 4580a92d44 (NFS:
Use server-recommended security flavor by default (NFSv3))
because nfs4_try_mount() will automatically select AUTH_SYS
if it sees no auth flavours.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Commit 4edaa308 "NFS: Use "krb5i" to establish NFSv4 state whenever possible"
uses the nfs_client cl_rpcclient for all state management operations, and
will use krb5i or auth_sys with no regard to the mount command authflavor
choice.
The MDS, as any NFSv4.1 mount point, uses the nfs_server rpc client for all
non-state management operations with a different nfs_server for each fsid
encountered traversing the mount point, each with a potentially different
auth flavor.
pNFS data servers are not mounted in the normal sense as there is no associated
nfs_server structure. Data servers can also export multiple fsids, each with
a potentially different auth flavor.
Data servers need to use the same authflavor as the MDS server rpc client for
non-state management operations. Populate a list of rpc clients with the MDS
server rpc client auth flavor for the DS to use.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We must avoid buffering a WRITE that is using a credential key (e.g. a GSS
context key) that is about to expire or has expired. We currently will
paint ourselves into a corner by returning success to the applciation
for such a buffered WRITE, only to discover that we do not have permission when
we attempt to flush the WRITE (and potentially associated COMMIT) to disk.
Use the RPC layer credential key timeout and expire routines which use a
a watermark, gss_key_expire_timeo. We test the key in nfs_file_write.
If a WRITE is using a credential with a key that will expire within
watermark seconds, flush the inode in nfs_write_end and send only
NFS_FILE_SYNC WRITEs by adding nfs_ctx_key_to_expire to nfs_need_sync_write.
Note that this results in single page NFS_FILE_SYNC WRITEs.
Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: removed a pr_warn_ratelimited() for now]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* labeled-nfs:
NFS: Apply v4.1 capabilities to v4.2
NFS: Add in v4.2 callback operation
NFS: Make callbacks minor version generic
Kconfig: Add Kconfig entry for Labeled NFS V4 client
NFS: Extend NFS xattr handlers to accept the security namespace
NFS: Client implementation of Labeled-NFS
NFS: Add label lifecycle management
NFS:Add labels to client function prototypes
NFSv4: Extend fattr bitmaps to support all 3 words
NFSv4: Introduce new label structure
NFSv4: Add label recommended attribute and NFSv4 flags
NFSv4.2: Added NFS v4.2 support to the NFS client
SELinux: Add new labeling type native labels
LSM: Add flags field to security_sb_set_mnt_opts for in kernel mount data.
Security: Add Hook to test if the particular xattr is part of a MAC model.
Security: Add hook to calculate context based on a negative dentry.
NFS: Add NFSv4.2 protocol constants
Conflicts:
fs/nfs/nfs4proc.c
The GETDEVICEINFO gdia_maxcount represents all of the data being returned
within the GETDEVICEINFO4resok structure and includes the XDR overhead.
The CREATE_SESSION ca_maxresponsesize is the maximum reply and includes the RPC
headers (including security flavor credentials and verifiers).
Split out the struct pnfs_device field maxcount which is the gdia_maxcount
from the pglen field which is the reply (the total) buffer length.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I found a few places that hardcode the minor version number rather than
making it dependent on the protocol the callback came in over. This
patch makes it easier to add new minor versions in the future.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will later allow NFS locking code to wait for readahead to complete
before releasing byte range locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit 324d003b0c.
The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove duplicate function declaration in internal.h
Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
[Trond: Added nfs_pageio_init_read, which suffered from the same problem]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We shouldn't need to pass the 'cache_reply' parameter if we
initialise the sequence_args/sequence_res in the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use nfs_sb_deactive_async instead of nfs_sb_deactive when in a workqueue
context. This avoids a deadlock where rpc_shutdown_client loops forever
in a workqueue kworker context, trying to kill all RPC tasks associated with
the client, while one or more of these tasks have already been assigned to the
same kworker (and will never run rpc_exit_task).
This approach is needed because RPC tasks that have already been assigned
to a kworker by queue_work cannot be canceled, as explained in the comment
for workqueue.c:insert_wq_barrier.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
[Trond: add module_get/put.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since commit c7f404b ('vfs: new superblock methods to override
/proc/*/mount{s,info}'), nfs_path() is used to generate the mounted
device name reported back to userland.
nfs_path() always generates a trailing slash when the given dentry is
the root of an NFS mount, but userland may expect the original device
name to be returned verbatim (as it used to be). Make this
canonicalisation optional and change the callers accordingly.
[jrnieder@gmail.com: use flag instead of bool argument]
Reported-and-tested-by: Chris Hiestand <chiestand@salk.edu>
Reference: http://bugs.debian.org/669314
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: <stable@vger.kernel.org> # v2.6.39+
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For buffer write, block layout client scan inode mapping to find
next hole and use offset-to-hole as layoutget length. Object
layout client uses offset-to-isize as layoutget length.
For direct write, both block layout and object layout use dreq->bytes_left.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
"Server trunking" is a fancy named for a multi-homed NFS server.
Trunking might occur if a client sends NFS requests for a single
workload to multiple network interfaces on the same server. There
are some implications for NFSv4 state management that make it useful
for a client to know if a single NFSv4 server instance is
multi-homed. (Note this is only a consideration for NFSv4, not for
legacy versions of NFS, which are stateless).
If a client cares about server trunking, no NFSv4 operations can
proceed until that client determines who it is talking to. Thus
server IP trunking discovery must be done when the client first
encounters an unfamiliar server IP address.
The nfs_get_client() function walks the nfs_client_list and matches
on server IP address. The outcome of that walk tells us immediately
if we have an unfamiliar server IP address. It invokes
nfs_init_client() in this case. Thus, nfs4_init_client() is a good
spot to perform trunking discovery.
Discovery requires a client to establish a fresh client ID, so our
client will now send SETCLIENTID or EXCHANGE_ID as the first NFS
operation after a successful ping, rather than waiting for an
application to perform an operation that requires NFSv4 state.
The exact process for detecting trunking is different for NFSv4.0 and
NFSv4.1, so a minorversion-specific init_client callout method is
introduced.
CLID_INUSE recovery is important for the trunking discovery process.
CLID_INUSE is a sign the server recognizes the client's nfs_client_id4
id string, but the client is using the wrong principal this time for
the SETCLIENTID operation. The SETCLIENTID must be retried with a
series of different principals until one works, and then the rest of
trunking discovery can proceed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fs/nfs/super.c: In function ‘nfs_compare_remount_data’:
fs/nfs/super.c:2042:18: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2043:18: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2044:20: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2046:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2047:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2048:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2049:21: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2050:18: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...
Replace all relevant occurences of page->index and page->mapping in the
NFS client with the new page_file_index() and page_file_mapping()
functions.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch exports symbols needed by the v4 module. In addition, I also
switch over to using IS_ENABLED() to check if CONFIG_NFS_V4 or
CONFIG_NFS_V4_MODULE are set.
The module (nfs4.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch exports symbols and moves over the final structures needed by
the v3 module. In addition, I also switch over to using IS_ENABLED() to
check if CONFIG_NFS_V3 or CONFIG_NFS_V3_MODULE are set.
The module (nfs3.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v3.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Somehow I missed this in my previous patch series, but these functions
are only needed by the v4 code and should be moved to a v4-only file. I
wasn't exactly sure where I should put these functions, so I moved them
into nfs4super.c where I could make them static.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I can set all variables in the nfs_fill_super() function, allowing me to
remove the nfs4_fill_super() function.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
v2 and v4 don't use it, so I create two new nfs_rpc_ops functions to
initialize the ACL client only when we are using v3.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I'm already looking up the nfs subversion in nfs_fs_mount(), so I have
easy access to rpc_ops that used to be difficult to reach. This allows
me to set up a different mount path for NFS v2/3 and NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds in the code to track multiple versions of the NFS
protocol. I created default structures for v2, v3 and v4 so that each
version can continue to work while I convert them into kernel modules.
I also removed the const parameter from the rpc_version array so that I
can change it at runtime.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This allows me to move the v4 mounting and unmounting functions out of
the generic client and into a file that is only compiled when CONFIG_NFS_V4
is enabled.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
These functions are specific to NFS v4 and can be moved to nfs4client.c
to keep them out of the generic client.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
And split these functions out of the generic client into a v4 specific
file.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch moves the NFS v4 file functions into a new file that is only
compiled when CONFIG_NFS_V4 is enabled.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch moves the NFS v2 file and directory inode functions into
files that are only compiled whet CONFIG_NFS_V2 is enabled.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
pNFS needs to select a write function based on the layout driver
currently in use, so I let each NFS version decide how to best handle
initializing writes.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
pNFS needs to select a read function based on the layout driver
currently in use, so I let each NFS version decide how to best handle
initializing reads.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This gives NFS v4 a way to set up callbacks and sessions without v2 or
v3 having to do them as well.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS v4 needs a way to shut down callbacks and sessions, but v2 and v3
don't.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use the same mechanism as the block devices are using, but move the
helper functions from fs/direct-io.c into fs/inode.c to remove the
dependency on CONFIG_BLOCK.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the struct nfs_client gets added to the global nfs_client_list
before it is initialised, it is possible that rpc_pipefs_event can
end up trying to create idmapper entries on such a thing.
The solution is to have the mount notification wait for the
initialisation of each nfs_client to complete, and then to
skip any entries for which the it failed.
Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Session initialisation is not complete until the lease manager
has run. We need to ensure that both nfs4_init_session and
nfs4_init_ds_session do so, and that they check for any resulting
errors in clp->cl_cons_state.
Only after this is done, can nfs4_ds_connect check the contents
of clp->cl_exchange_flags.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Andy Adamson <andros@netapp.com>
"noresvport" and "discrtry" can be passed to nfs_create_rpc_client()
by setting flags in the passed-in nfs_client. This change makes it
easy to add new flags.
Note that these settings are now "sticky" over the lifetime of a
struct nfs_client, and may even be copied when an nfs_client is
cloned.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Continue to rationalize the locking in nfs_get_client() by
moving the logic that handles the case where a matching server IP
address is not found.
When we support server trunking detection, client initialization may
return a different nfs_client struct than was passed to it. Change
the synopsis of the init_client methods to return an nfs_client.
The client initialization logic in nfs_get_client() is not much more
than a wrapper around ->init_client. It's simpler to keep the little
bits of error handling in the version-specific init_client methods.
No behavior change is expected.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replaced by filelayout_reset_write and filelayout_reset_read
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set the recovery parameters for data servers.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RPC_TASK_SOFTCONN returns connection errors to the caller which allows the pNFS
file layout to quickly try the MDS or perhaps another DS.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In theory, NFS v3 can have different error versions than NFS v2. v4 is
already using its own nfs4_stat_to_errno() to map error codes, so
rather than create something in the generic client for v2 and v3 to
share I instead give v3 its own function.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The v2/3 and v4 cases were very similar, with just a few parameters
changed. This makes it easy to share code.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This simplifies the code for v2 and v3 and gives v4 a chance to decide
on referrals without needing to modify the generic client.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This also has the advantage that it allows directio to use pnfs.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factors out the code that needs to change when directio
starts using these code paths.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is COMMIT that is handled the most differently between
the paged and direct paths. Create a structure that encapsulates
everything either path needs to know about the commit state.
We could use void to hide some of the layout driver stuff, but
Trond suggests pulling it out to ensure type checking, given the
huge changes being made, and the fact that it doesn't interfere
with other drivers.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This also has the advantage that it allows directio to use pnfs.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factors out the code that will need to change when directio
starts using these code paths. This will allow directio to use
the generic pagein and flush routines
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Decouple nfs_pgio_header and nfs_write_data, and have (possibly
multiple) nfs_write_datas each take a refcount on nfs_pgio_header.
For the moment keeps nfs_write_header as a way to preallocate a single
nfs_write_data with the nfs_pgio_header. The code doesn't need this,
and would be prettier without, but given the amount of churn I am
already introducing I didn't want to play with tuning new mempools.
This also fixes bug in pnfs_ld_handle_write_error. In the case of
desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing
replay attempt to do nothing.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Decouple nfs_pgio_header and nfs_read_data, and have (possibly
multiple) nfs_read_datas each take a refcount on nfs_pgio_header.
For the moment keeps nfs_read_header as a way to preallocate a single
nfs_read_data with the nfs_pgio_header. The code doesn't need this,
and would be prettier without, but given the amount of churn I am
already introducing I didn't want to play with tuning new mempools.
This also fixes bug in pnfs_ld_handle_read_error. In the case of
desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing
replay attempt to do nothing.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both nfs_read_data and nfs_write_data devote several fields which
can be combined into a single shared struct.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In order to avoid duplicating all the data in nfs_read_data whenever we
split it up into multiple RPC calls (either due to a short read result
or due to rsize < PAGE_SIZE), we split out the bits that are the same
per RPC call into a separate "header" structure.
The goal this patch moves towards is to have a single header
refcounted by several rpc_data structures. Thus, want to always refer
from rpc_data to the header, and not the other way. This patch comes
close to that ideal, but the directio code currently needs some
special casing, isolated in the nfs_direct_[read_write]hdr_release()
functions. This will be dealt with in a future patch.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make it consistent with nfs_initiate_commit.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commits don't need the vectors of pages, etc. that writes do. Split out
a separate structure for the commit operation.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The authflavor is set in an nfs_clone_mount structure and passed to the
xdev_mount() functions where it was promptly ignored. Instead, use it
to initialize an rpc_clnt for the cloned server.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I create a new proc_lookup_mountpoint() to use when submounting an NFS
v4 share. This function returns an rpc_clnt to use for performing an
fs_locations() call on a referral's mountpoint.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Whenever lookup sees wrongsec do a secinfo and retry the lookup to find
attributes of the file or directory, such as "is this a referral
mountpoint?". This also allows me to remove handling -NFS4ERR_WRONSEC
as part of getattr xdr decoding.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move more pnfs-isms out of the generic commit code.
Bugfixes:
- filelayout_scan_commit_lists doesn't need to get/put the lseg.
In fact since it is run under the inode->i_lock, the lseg_put()
can deadlock.
- Ensure that we distinguish between what needs to be done for
commit-to-data server and what needs to be done for commit-to-MDS
using the new flag PG_COMMIT_TO_DS. Otherwise we may end up calling
put_lseg() on a bucket for a struct nfs_page that got written
through the MDS.
- Fix a case where we were using list_del() on an nfs_page->wb_list
instead of list_del_init().
- filelayout_initiate_commit needs to call filelayout_commit_release
on error instead of the mds_ops->rpc_release(). Otherwise it won't
clear the commit lock.
Cleanups:
- Let the files layout manage the commit lists for the pNFS case.
Don't expose stuff like pnfs_choose_commit_list, and the fact
that the commit buckets hold references to the layout segment
in common code.
- Cast out the put_lseg() calls for the struct nfs_read/write_data->lseg
into the pNFS layer from whence they came.
- Let the pNFS layer manage the NFS_INO_PNFS_COMMIT bit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
The radix tree is only being used to compile lists of reqs needing commit.
It is simpler to just put the reqs directly into a list.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Network namespace is taken from request transport and passed as a part of
cb_process_state structure.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch makes nfs_clients_lock allocated per network namespace. All items it
protects are already network namespace aware.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch makes ID's infrastructure network namespace aware. This was done
mainly because of nfs_client_lock, which is desired to be per network
namespace, but protects NFS clients ID's.
NOTE: NFS client's net pointer have to be set prior to ID initialization,
proper assignment was moved.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch splits global list of NFS clients into per-net-ns array of lists.
This looks more strict and clearer.
BTW, this patch also makes "/proc/fs/nfsfs/servers" entry content depends on
/proc mount owner pid namespace. See below for details.
NOTE: few words about how was /proc/fs/nfsfs/ entries content show per network
namespace done. This is a little bit tricky and not the best is could be. But
it's cheap (proper fix for /proc conteinerization is a hard nut to crack).
The idea is simple: take proper network namespace from pid namespace
child reaper nsproxy of /proc/ mount creator.
This actually means, that if there are 2 containers with different net
namespace sharing pid namespace, then read of /proc/fs/nfsfs/ entries will
always return content, taken from net namespace of pid namespace creator task
(and thus second namespace set wil be unvisible).
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Looks like this function survived after some cleanup patch without a reason.
Now it's not called or referenced and I believe, that it can be simply removed.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
v2:
1) Added "nfs_idmap_init" and "nfs_idmap_quit" definitions for kernels built
without CONFIG_NFS_V4 option set.
This patch subscribes NFS clients to RPC pipefs notifications. Idmap notifier
is registering on NFS module load. This notifier callback is responsible for
creation/destruction of PipeFS idmap pipe dentry for NFS4 clients.
Since ipdmap pipe is created in rpc client pipefs directory, we have make sure,
that this directory has been created already. IOW RPC client notifier callback
has been called already. To achive this, PipeFS notifier priorities has been
introduced (RPC clients notifier priority is greater than NFS idmap one).
But this approach gives another problem: unlink for RPC client directory will
be called before NFS idmap pipe unlink on UMOUNT event and will fail, because
directory is not empty.
The solution, introduced in this patch, is to try to remove client directory
once again after idmap pipe was unlinked. This looks like ugly hack, so
probably it should be replaced in some more elegant way.
Note that no locking required in notifier callback because PipeFS superblock
pointer is passed as an argument from it's creation or destruction routine and
thus we can be sure about it's validity.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds new net variable to nfs_client structure. This variable is set
on NFS client creation and cheched during matching NFS client search.
Initially current->nsproxy->net_ns is used as network namespace owner for new
NFS client to create. This network namespace pointer is set during mount
options parsing and thus can be passed from user-spave utils in future if will
be necessary.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have no business doing any this in the standard write release path.
Get rid of it, and put it in the pNFS layer.
Also, while we're at it, get rid of the completely bogus unlock/relock
semantics that were present in nfs_writeback_release_full(). It is
not only unnecessary, but actually dangerous to release the write lock
just in order to take it again in nfs_page_async_flush(). Better just
to open code the pgio operations in a pnfs helper.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>