When using multi-homed machines, it's nice to be able to specify
the local IP to use for outbound connections. This patch gives
cifs the ability to bind to a particular IP address.
Usage: mount -t cifs -o srcaddr=192.168.1.50,user=foo, ...
Usage: mount -t cifs -o srcaddr=2002:💯1,user=foo, ...
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Dr. David Holder <david.holder@erion.co.uk>
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Attribue Value (AV) pairs or Target Info (TI) pairs are part of
ntlmv2 authentication.
Structure ntlmv2_resp had only definition for two av pairs.
So removed it, and now allocation of av pairs is dynamic.
For servers like Windows 7/2008, av pairs sent by server in
challege packet (type 2 in the ntlmssp exchange/negotiation) can
vary.
Server sends them during ntlmssp negotiation. So when ntlmssp is used
as an authentication mechanism, type 2 challenge packet from server
has this information. Pluck it and use the entire blob for
authenticaiton purpose. If user has not specified, extract
(netbios) domain name from the av pairs which is used to calculate
ntlmv2 hash. Servers like Windows 7 are particular about the AV pair
blob.
Servers like Windows 2003, are not very strict about the contents
of av pair blob used during ntlmv2 authentication.
So when security mechanism such as ntlmv2 is used (not ntlmv2 in ntlmssp),
there is no negotiation and so genereate a minimal blob that gets
used in ntlmv2 authentication as well as gets sent.
Fields tilen and tilbob are session specific. AV pair values are defined.
To calculate ntlmv2 response we need ti/av pair blob.
For sec mech like ntlmssp, the blob is plucked from type 2 response from
the server. From this blob, netbios name of the domain is retrieved,
if user has not already provided, to be included in the Target String
as part of ntlmv2 hash calculations.
For sec mech like ntlmv2, create a minimal, two av pair blob.
The allocated blob is freed in case of error. In case there is no error,
this blob is used in calculating ntlmv2 response (in CalcNTLMv2_response)
and is also copied on the response to the server, and then freed.
The type 3 ntlmssp response is prepared on a buffer,
5 * sizeof of struct _AUTHENTICATE_MESSAGE, an empirical value large
enough to hold _AUTHENTICATE_MESSAGE plus a blob with max possible
10 values as part of ntlmv2 response and lmv2 keys and domain, user,
workstation names etc.
Also, kerberos gets selected as a default mechanism if server supports it,
over the other security mechanisms.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Change name of variable mac_key to session key.
The reason mac_key was changed to session key is, this structure does not
hold message authentication code, it holds the session key (for ntlmv2,
ntlmv1 etc.). mac is generated as a signature in cifs_calc* functions.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_new_fileinfo() does not use the 'oplock' value from the callers. Instead,
it sets it to REQ_OPLOCK which seems wrong. We should be using the oplock value
obtained from the Server to set the inode's clientCanCacheAll or
clientCanCacheRead flags. Fix this by passing oplock from the callers to
cifs_new_fileinfo().
This change dates back to commit a6ce4932 (2.6.30-rc3). So, all the affected
versions will need this fix. Please Cc stable once reviewed and accepted.
Cc: Stable <stable@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
... and avoid implicit casting from a signed type. Also, pass oplock by value
instead by reference as we don't intend to change the value in
cifs_open_inode_helper().
Thanks to Jeff Layton for spotting this.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Recently a feature was added to GFS2 to allow journal id allocation
via sysfs. This patch builds upon that so that a negative journal id
will be treated as an error code to be passed back as the return code
from mount. This allows termination of the mount process if there is
a failure.
Also, the process has been updated so that the kernel will wait
for a journal id, even in the "spectator" case. This is required
in order to avoid mounting a filesystem in case there is an error
while joining the cluster. In the spectator case, 0 is written into
the file to indicate that all is well, and that mount should continue.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
XFS supports the "norecovery" mount option which is basically the
same as the GFS2 spectator mode. This adds support for "norecovery"
as a synonym for spectator mode, which is hopefully a more obvious
description of what it actually does.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The tests further down the recovery function relating to
unlocking the journal need to be updated to match the
intial test. Also, a test in the umount code which was
surplus to requirements has been removed. Umounting
spectator mounts now works correctly, as expected.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I have been seeing occasional pauses in transaction throughput up to
30s long under heavy parallel workloads. The only notable thing was
that the xfsaild was trying to be active during the pauses, but
making no progress. It was running exactly 20 times a second (on the
50ms no-progress backoff), and the number of pushbuf events was
constant across this time as well. IOWs, the xfsaild appeared to be
stuck on buffers that it could not push out.
Further investigation indicated that it was trying to push out inode
buffers that were pinned and/or locked. The xfsbufd was also getting
woken at the same frequency (by the xfsaild, no doubt) to push out
delayed write buffers. The xfsbufd was not making any progress
because all the buffers in the delwri queue were pinned. This scan-
and-make-no-progress dance went one in the trace for some seconds,
before the xfssyncd came along an issued a log force, and then
things started going again.
However, I noticed something strange about the log force - there
were way too many IO's issued. 516 log buffers were written, to be
exact. That added up to 129MB of log IO, which got me very
interested because it's almost exactly 25% of the size of the log.
He delayed logging code is suppose to aggregate the minimum of 25%
of the log or 8MB worth of changes before flushing. That's what
really puzzled me - why did a log force write 129MB instead of only
8MB?
Essentially what has happened is that no CIL pushes had occurred
since the previous tail push which cleared out 25% of the log space.
That caused all the new transactions to block because there wasn't
log space for them, but they kick the xfsaild to push the tail.
However, the xfsaild was not making progress because there were
buffers it could not lock and flush, and the xfsbufd could not flush
them because they were pinned. As a result, both the xfsaild and the
xfsbufd could not move the tail of the log forward without the CIL
first committing.
The cause of the problem was that the background CIL push, which
should happen when 8MB of aggregated changes have been committed, is
being held off by the concurrent transaction commit load. The
background push does a down_write_trylock() which will fail if there
is a concurrent transaction commit holding the push lock in read
mode. With 8 CPUs all doing transactions as fast as they can, there
was enough concurrent transaction commits to hold off the background
push until tail-pushing could no longer free log space, and the halt
would occur.
It should be noted that there is no reason why it would halt at 25%
of log space used by a single CIL checkpoint. This bug could
definitely violate the "no transaction should be larger than half
the log" requirement and hence result in corruption if the system
crashed under heavy load. This sort of bug is exactly the reason why
delayed logging was tagged as experimental....
The fix is to start blocking background pushes once the threshold
has been exceeded. Rework the threshold calculations to keep the
amount of log space a CIL checkpoint can use to below that of the
AIL push threshold to avoid the problem completely.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This shouldn't really be required, but gcc can't tell that
"al" is only accessed when initialised.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some of the functions in GFS2 were not reserving space in the transaction for
the resource group header and the resource groups bitblocks that get added
when you do allocation. GFS2 now makes sure to reserve space for the
resource group header and either all the bitblocks in the resource group, or
one for each block that it may allocate, whichever is smaller using the new
gfs2_rg_blocks() inline function.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
.get_sb is called on mounts with automatic fs detection too, so this
function should print an error if it cannot read the superblock in
debug mode only (new behaviour conforms the other fs types)
Signed-off-by: Steffen Sledz <sledz@dresearch.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When checking journals for spectator mounts, we cannot rely on the
journal being locked, whatever its jid might be. This patch
ensures that we always get the journal locks when checking
journals for a spectator mount.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
o2dlm: force free mles during dlm exit
ocfs2: Sync inode flags with ext2.
ocfs2: Move 'wanted' into parens of ocfs2_resmap_resv_bits.
ocfs2: Use cpu_to_le16 for e_leaf_clusters in ocfs2_bg_discontig_add_extent.
ocfs2: update ctime when changing the file's permission by setfacl
ocfs2/net: fix uninitialized ret in o2net_send_message_vec()
Ocfs2: Handle empty list in lockres_seq_start() for dlmdebug.c
Ocfs2: Re-access the journal after ocfs2_insert_extent() in dxdir codes.
ocfs2: Fix lockdep warning in reflink.
ocfs2/lockdep: Move ip_xattr_sem out of ocfs2_xattr_get_nolock.
This option has never done anything useful. Also at the same time
this cleans up the sb checks which are done at mount time. The
debug option will be accepted, but ignored in future. Since it
didn't do anything, there didn't seem much point in retaining it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.
This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We sync our inode flags with ext2 and define them by hex
values. But actually in commit 3669567(4 years ago), all
these values are moved to include/linux/fs.h. So we'd
better also use them as what ext2 did. So sync our inode
flags with ext2 by using FS_*.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The first time I read the function ocfs2_resmap_resv_bits, I consider
about what 'wanted' will be used and consider about the comments.
Then I find it is only used if the reservation is empty. ;)
So we'd better move it to the parens so that it make the code more
readable, what's more, ocfs2_resmap_resv_bits is used so frequently
and we should save some cpus.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
e_leaf_clusters is a le16, so use cpu_to_le16 instead
of cpu_to_le32.
What's more, we change 'clusters' to unsigned int to
signify that the size of 'clusters' isn't important here.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In commit 30e2bab, ext3 fixed it. So change it accordingly in ocfs2.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1283760364
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1283760364
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This option defaulted to on for lock_nolock mounts and off
otherwise. The only function was to avoid the revalidation of
dentries. In the cluster case, that is entirely pointless and
liable to cause coherency problems.
The patch changes the revalidation to depend upon whether the
fs is a local or cluster fs (i.e. it follows the existing default
behaviour). I very much doubt anybody ever used this option as
there is no reason to. Even so we will continue to accept it
on the mount command line, but ignore it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is been a no-op for a very long time now. I'm pretty sure
nobody uses it, but just in case we'll still accept it on the
command line, but ignore it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
'excpet' should be 'except'.
'ext3_get_branch' should be 'ext2_get_branch'.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
OCFS2 can return ERESTARTSYS from its write function when the process is
signalled while waiting for a cluster lock (and the filesystem is mounted
with intr mount option). Generally, it seems reasonable to allow
filesystems to return this error code from its IO functions. As we must
not leak ERESTARTSYS (and similar error codes) to userspace as a result of
an AIO operation, we have to properly convert it to EINTR inside AIO code
(restarting the syscall isn't really an option because other AIO could
have been already submitted by the same io_submit syscall).
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 73296bc611 ("procfs: Use generic_file_llseek in /proc/vmcore")
broke seeking on /proc/vmcore. This changes it back to use default_llseek
in order to restore the original behaviour.
The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is zero on procfs and some other virtual
file systems. We should merge generic_file_llseek and default_llseek some
day and clean this up in a proper way, but for 2.6.35/36, reverting vmcore
is the safer solution.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 32-bit compatibility mode, the error handling for
compat_do_readv_writev() may free an uninitialized pointer, potentially
leading to all sorts of ugly memory corruption. This is reliably
triggerable by unprivileged users by invoking the readv()/writev()
syscalls with an invalid iovec pointer. The below patch fixes this to
emulate the non-compat version.
Introduced by commit b83733639a ("compat: factor out
compat_rw_copy_check_uvector from compat_do_readv_writev")
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Cc: stable@kernel.org (2.6.35)
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends
char: Mark /dev/zero and /dev/kmem as not capable of writeback
bdi: Initialize noop_backing_dev_info properly
cfq-iosched: fix a kernel OOPs when usb key is inserted
block: fix blk_rq_map_kern bio direction flag
cciss: freeing uninitialized data on error path
Inodes of devices such as /dev/zero can get dirty for example via
utime(2) syscall or due to atime update. Backing device of such inodes
(zero_bdi, etc.) is however unable to handle dirty inodes and thus
__mark_inode_dirty complains. In fact, inode should be rather dirtied
against backing device of the filesystem holding it. This is generally a
good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
in these pseudofilesystems are referenced from ordinary filesystem
inodes and carry mapping with real data of the device. Thus for these
inodes we have to use inode->i_mapping->backing_dev_info as we did so
far. We distinguish these filesystems by checking whether sb->s_bdi
points to a non-trivial backing device or not.
Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
There's a device inode A described by a path "/dev/sdb" on this
filesystem. This inode will be dirtied against backing device "8:0"
after this patch. bdev filesystem contains block device inode B coupled
with our inode A. When someone modifies a page of /dev/sdb, it's B that
gets dirtied and the dirtying happens against the backing device "8:16".
Thus both inodes get filed to a correct bdi list.
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
These devices don't do any writeback but their device inodes still can get
dirty so mark bdi appropriately so that bdi code does the right thing and files
inodes to lists of bdi carrying the device inodes.
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: select CRYPTO
ceph: check mapping to determine if FILE_CACHE cap is used
ceph: only send one flushsnap per cap_snap per mds session
ceph: fix cap_snap and realm split
ceph: stop sending FLUSHSNAPs when we hit a dirty capsnap
ceph: correctly set 'follows' in flushsnap messages
ceph: fix dn offset during readdir_prepopulate
ceph: fix file offset wrapping at 4GB on 32-bit archs
ceph: fix reconnect encoding for old servers
ceph: fix pagelist kunmap tail
ceph: fix null pointer deref on anon root dentry release
Fsync performance for small files achieved by cfq on high-end disks is
lower than what deadline can achieve, due to idling introduced between
the sync write happening in process context and the journal commit.
Moreover, when competing with a sequential reader, a process writing
small files and fsync-ing them is starved.
This patch fixes the two problems by:
- marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
flag set,
- force all queues that have REQ_NOIDLE requests to be put in the noidle
tree.
Having the queue associated to the fsync-ing process and the one associated
to journal commits in the noidle tree allows:
- switching between them without idling,
- fairness vs. competing idling queues, since they will be serviced only
after the noidle tree expires its slice.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Rather than calculating the qstrs for . and .. each time
we need them, its better to keep a constant version of
these and just refer to them when required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
The recovery workqueue can be freezable since
we want it to finish what it is doing if the system is to
be frozen (although why you'd want to freeze a cluster node
is beyond me since it will result in it being ejected from
the cluster). It does still make sense for single node
GFS2 filesystems though.
The glock workqueue will benefit from being able to run more
work items concurrently. A test running postmark shows
improved performance and multi-threaded workloads are likely
to benefit even more. It needs to be high priority because
the latency directly affects the latency of filesystem glock
operations.
The delete workqueue is similar to the recovery workqueue in
that it must not get blocked by memory allocations, and may
run for a long time.
Potentially other GFS2 threads might also be converted to
workqueues, but I'll leave that for a later patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
GFS2's idea of which return codes it needs to handle was based
upon those listed in dlm.h. Those didn't cover all the possible
codes and listed some which never happen. This updates GFS2 to
handle all the codes which can actually be returned from the
DLM under various circumstances.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Due to the design of the VFS, it is quite usual for operations on GFS2
to consist of a lookup (requiring a shared lock) followed by an
operation requiring an exclusive lock. If a remote node has cached an
exclusive lock, then it will receive two demote events in rapid succession
firstly for a shared lock and then to unlocked. The existing min hold time
code was triggering in this case, even if the node was otherwise idle
since the state change time was being updated by the initial demote.
This patch introduces logic to skip the min hold timer in the case that
a "double demote" of this kind has occurred. The min hold timer will
still be used in all other cases.
A new glock flag is introduced which is used to keep track of whether
there have been any newly queued holders since the last glock state
change. The min hold time is only applied if the flag is set.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
This patch adds support for fallocate to gfs2. Since the gfs2 does not support
uninitialized data blocks, it must write out zeros to all the blocks. However,
since it does not need to lock any pages to read from, gfs2 can write out the
zero blocks much more efficiently. On a moderately full filesystem, fallocate
works around 5 times faster on average. The fallocate call also allows gfs2 to
add blocks to the file without changing the filesize, which will make it
possible for gfs2 to preallocate space for the rindex file, so that gfs2 can
grow a completely full filesystem.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds a check to ensure that if we reach the block allocator
that we don't try and proceed if there is no alloc structure
hanging off the inode. This should only happen if there is a bug
in GFS2. The error return code is distinctive in order that it
will be easily spotted.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With the update of the truncate code, ip->i_disksize and
inode->i_size are merely copies of each other. This means
we can remove ip->i_disksize and use inode->i_size exclusively
reducing the size of a GFS2 inode by 8 bytes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This updates GFS2's truncate code to use the new truncate
sequence correctly. This is a stepping stone to being
able to remove ip->i_disksize in favour of using i_size
everywhere now that the two sizes are always identical.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Commit 2fde99cb55 "UBIFS: mark VFS SB RO too"
introduced regression. This commit made UBIFS set the 'MS_RDONLY' flag in the
VFS superblock when it switches to R/O mode due to an error. This was done
to make VFS show the R/O UBIFS flag in /proc/mounts.
However, several places in UBIFS relied on the 'MS_RDONLY' flag and assume this
flag can only change when we re-mount. For example, 'ubifs_put_super()'.
This patch introduces new UBIFS flag - 'c->ro_mount' which changes only when
we re-mount, and preserves the way UBIFS was originally mounted (R/W or R/O).
This allows us to de-initialize UBIFS cleanly in 'ubifs_put_super()'.
This patch also changes all 'ubifs_assert(!c->ro_media)' assertions to
'ubifs_assert(!c->ro_media && !c->ro_mount)', because we never should write
anything if the FS was mounter R/O.
All the places where we test for 'MS_RDONLY' flag in the VFS SB were changed
and now we test the 'c->ro_mount' flag instead, because it preserves the
original UBIFS mount type, unlike the 'MS_RDONLY' flag.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Coda's REQ_* defines were renamed to avoid clashes with the block layer
(commit 4aeefdc69f7b: "coda: fixup clash with block layer REQ_*
defines").
However one was missed and response messages are no longer matched with
requests and waiting threads are no longer woken up. This patch fixes
this.
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
[ Also fixed up whitespace while at it -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmotm/fs/ocfs2/cluster/tcp.c: In function ‘o2net_send_message_vec’:
mmotm/fs/ocfs2/cluster/tcp.c:980:6: warning: ‘ret’ may be used uninitialized in this function
It seems a real bug introduced by commit 9af0b38ff3 (ocfs2/net:
Use wait_event() in o2net_send_message_vec()).
cc: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
See if the i_data mapping has any pages to determine if the FILE_CACHE
capability is currently in use, instead of assuming it is any time the
rdcache_gen value is set (i.e., issued -> used).
This allows the MDS RECALL_STATE process work for inodes that have cached
pages.
Signed-off-by: Sage Weil <sage@newdream.net>
Sending multiple flushsnap messages is problematic because we ignore
the response if the tid doesn't match, and the server may only respond to
each one once. It's also a waste.
So, skip cap_snaps that are already on the flushing list, unless the caller
tells us to resend (because we are reconnecting).
Signed-off-by: Sage Weil <sage@newdream.net>
The R/O state may have various reasons:
1. The UBI volume is R/O
2. The FS is mounted R/O
3. The FS switched to R/O mode because of an error
However, in UBIFS we have only one variable which represents cases
1 and 3 - 'c->ro_media'. Indeed, we set this to 1 if we switch to
R/O mode due to an error, and then we test it in many places to
make sure that we stop writing as soon as the error happens.
But this is very unclean. One consequence of this, for example, is
that in 'ubifs_remount_fs()' we use 'c->ro_media' to check whether
we are in R/O mode because on an error, and we print a message
in this case. However, if we are in R/O mode because the media
is R/O, our message is bogus.
This patch introduces new flag - 'c->ro_error' which is set when
we switch to R/O mode because of an error. It also changes all
"if (c->ro_media)" checks to "if (c->ro_error)" checks, because
this is what the checks actually mean. We do not need to check
for 'c->ro_media' because if the UBI volume is in R/O mode, we
do not allow R/W mounting, and now writes can happen. This is
guaranteed by VFS. But it is good to double-check this, so this
patch also adds many "ubifs_assert(!c->ro_media)" checks.
In the 'ubifs_remount_fs()' function this patch makes a bit more
changes - it fixes the error messages as well.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Looks like this crept in, in a recent update.
Reported-by: Krzysztof Urbaniak <urban@bash.org.pl>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The cap_snap creation/queueing relies on both the current i_head_snapc
_and_ the i_snap_realm pointers being correct, so that the new cap_snap
can properly reference the old context and the new i_head_snapc can be
updated to reference the new snaprealm's context. To fix this, we:
- move inodes completely to the new (split) realm so that i_snap_realm
is correct, and
- generate the new snapc's _before_ queueing the cap_snaps in
ceph_update_snap_trace().
Signed-off-by: Sage Weil <sage@newdream.net>
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When CONFIG_OCFS2_DEBUG_MASKLOG is undefined, we don't use the dentry
variable in ocfs2_sync_file(). Let's just move all access to the dentry
inside the logging call.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This change extends the partition_meta_info structure to
support EFI GPT-specific metadata and ensures that data
is copied in on partition scanning.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I'm reposting this patch series as v4 since there have been no additional
comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
are against Linus's tree: 2bfc96a127
(2.6.36-rc3).
Would this patchset be suitable for inclusion in an mm branch?
This changes adds a partition_meta_info struct which itself contains a
union of structures that provide partition table specific metadata.
This change leaves the union empty. The subsequent patch includes an
implementation for CONFIG_EFI_PARTITION-based metadata.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies
statfs() gives ESTALE error
NFS: Fix a typo in nfs_sockaddr_match_ipaddr6
sunrpc: increase MAX_HASHTABLE_BITS to 14
gss:spkm3 miss returning error to caller when import security context
gss:krb5 miss returning error to caller when import security context
Remove incorrect do_vfs_lock message
SUNRPC: cleanup state-machine ordering
SUNRPC: Fix a race in rpc_info_open
SUNRPC: Fix race corrupting rpc upcall
Fix null dereference in call_allocate
Tavis Ormandy pointed out that do_io_submit does not do proper bounds
checking on the passed-in iocb array:
if (unlikely(nr < 0))
return -EINVAL;
if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
return -EFAULT; ^^^^^^^^^^^^^^^^^^
The attached patch checks for overflow, and if it is detected, the
number of iocbs submitted is scaled down to a number that will fit in
the long. This is an ok thing to do, as sys_io_submit is documented as
returning the number of iocbs submitted, so callers should handle a
return value of less than the 'nr' argument passed in.
Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cifs_get_smb_ses must be called on a server pointer on which it holds an
active reference. It first does a search for an existing SMB session. If
it finds one, it'll put the server reference and then try to ensure that
the negprot is done, etc.
If it encounters an error at that point then it'll return an error.
There's a potential problem here though. When cifs_get_smb_ses returns
an error, the caller will also put the TCP server reference leading to a
double-put.
Fix this by having cifs_get_smb_ses only put the server reference if
it found an existing session that it could use and isn't returning an
error.
Cc: stable@kernel.org
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Stop sending FLUSHSNAP messages when we hit a capsnap that has dirty_pages
or is still writing. We'll send the newer capsnaps only after the older
ones complete.
Signed-off-by: Sage Weil <sage@newdream.net>
The 'follows' should match the seq for the snap context for the given snap
cap, which is the context under which we have been dirtying and writing
data and metadata. The snapshot that _contains_ those updates thus
_follows_ that context's seq #.
Signed-off-by: Sage Weil <sage@newdream.net>
When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset. This can cause the readdir result offsets
to get out of sync with the server. Add an argument to the helper so
that it does not.
This bug was introduced by 1cd3935bed.
Signed-off-by: Sage Weil <sage@newdream.net>
We should not use dotlversion for the dotu inode operations
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We should use the cached dentry operation only if caching mode is enabled
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
NULL fid should be handled in cases where we endup calling v9fs_dir_release()
before even we instantiate the fid in filp.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This was introduced by 7cadb63d58a932041afa3f957d5cbb6ce69dcee5
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The NFSv4 client's callback server calls svc_gss_principal(), which
is defined in the auth_rpcgss.ko
The NFSv4 server has the same dependency, and in addition calls
svcauth_gss_flavor(), gss_mech_get_by_pseudoflavor(),
gss_pseudoflavor_to_service() and gss_mech_put() from the same module.
The module auth_rpcgss itself has no dependencies aside from sunrpc,
so we only need to select RPCSEC_GSS.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi,
An NFS client executes a statfs("file", &buff) call.
"file" exists / existed, the client has read / written it,
but it has already closed it.
user_path(pathname, &path) looks up "file" successfully in the
directory-cache and restarts the aging timer of the directory-entry.
Even if "file" has already been removed from the server, because the
lookupcache=positive option I use, keeps the entries valid for a while.
nfs_statfs() returns ESTALE if "file" has already been removed from the
server.
If the user application repeats the statfs("file", &buff) call, we
are stuck: "file" remains young forever in the directory-cache.
Signed-off-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
The do_vfs_lock function on fs/nfs/file.c is only called if NLM is
not being used, via the -onolock mount option. Therefore it cannot
really be "out of sync with lock manager" when the local locking
function called returns an error, as there will be no corresponding
call to the NLM. For details, simply check the if/else on do_setlk
and do_unlk on fs/nfs/file.c.
Signed-Off-By: Fabio Olive Leite <fleite@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cast the value before shifting so that we don't run out of bits with a
32-bit unsigned long. This fixes wrapping of high file offsets into the
low 4GB of a file on disk, and the subsequent data corruption for large
files.
Signed-off-by: Sage Weil <sage@newdream.net>
Fix the reconnect encoding to encode the cap record when the MDS does not
have the FLOCK capability (i.e., pre v0.22).
Signed-off-by: Sage Weil <sage@newdream.net>
When we release a root dentry, particularly after a splice, the parent
(actually our) inode was evaluating to NULL and was getting dereferenced
by ceph_snap(). This is reproduced by something as simple as
mount -t ceph monhost:/a/b mnt
mount -t ceph monhost:/a mnt2
ls mnt2
A splice_dentry() would kill the old 'b' inode's root dentry, and we'd
crash while releasing it.
Fix by checking for both the ROOT and NULL cases explicitly. We only need
to invalidate the parent dir when we have a correct parent to invalidate.
Signed-off-by: Sage Weil <sage@newdream.net>
This patch tries to handle the case in which list 'dlm->tracking_list' is
empty, to avoid accessing an invalid pointer. It fixes the following oops:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1287
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_dx_dir_rebalance(), we need to rejournal_acess the blocks after
calling ocfs2_insert_extent() since growing an extent tree may trigger
ocfs2_extend_trans(), which makes previous journal_access meaningless.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Track negative dentries by recording the generation number of the parent
directory in d_fsdata. The generation number for the parent directory is
recorded in the inode_info, which increments every time the lock on the
directory is dropped.
If the generation number of the parent directory and the negative dentry
matches, there is no need to perform the revalidate, else a revalidate
is forced. This improves performance in situations where nodes look for
the same non-existent file multiple times.
Thanks Mark for explaining the DLM sequence.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Durring orphan scan, if we are slot 0, and we are replaying
orphan_dir:0001, the general process is that for every file
in this dir:
1. we will iget orphan_dir:0001, since there is no inode for it.
we will have to create an inode and read it from the disk.
2. do the normal work, such as delete_inode and remove it from
the dir if it is allowed.
3. call iput orphan_dir:0001 when we are done. In this case,
since we have no dcache for this inode, i_count will
reach 0, and VFS will have to call clear_inode and in
ocfs2_clear_inode we will checkpoint the inode which will let
ocfs2_cmt and journald begin to work.
4. We loop back to 1 for the next file.
So you see, actually for every deleted file, we have to read the
orphan dir from the disk and checkpoint the journal. It is very
time consuming and cause a lot of journal checkpoint I/O.
A better solution is that we can have another reference for these
inodes in ocfs2_super. So if there is no other race among
nodes(which will let dlmglue to checkpoint the inode), for step 3,
clear_inode won't be called and for step 1, we may only need to
read the inode for the 1st time. This is a big win for us.
So this patch will try to cache system inodes of other slots so
that we will have one more reference for these inodes and avoid
the extra inode read and journal checkpoint.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
generic_check_addressable() erroneously shifts pages down by a block
factor when it should be shifting up. To prevent overflow, we shift
blocks down to pages.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB. But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.
This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.
I factored out the sanity check into its own function. I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.
This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").
I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal. All of them appear
to work or to fail gracefully, as appropriate.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Before we start accessing a huge (> 16 TiB) OCFS2 volume, we need to
confirm that its journal supports 64-bit offsets. In particular, we
need to check the journal's feature bits before recovering the journal.
This is not possible with JBD2 at present, because the journal
superblock (where the feature bits reside) is not loaded from disk until
the journal is recovered.
This patch loads the journal superblock in
jbd2_journal_check_used_features() if it has not already been loaded,
allowing us to check the feature bits before journal recovery.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.
An identical check already appears in ext3 and ext4. This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.
[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_sync_inode() is used only from ocfs2_sync_file(). But all data has
already been written before calling ocfs2_sync_file() and ocfs2 doesn't use
inode's private_list for tracking metadata buffers thus sync_mapping_buffers()
is superfluous as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Thanks for the comments. I have incorportated them all.
CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
Statistics now look like -
ocfs2_write_ctxt: 2144 - 2136 = 8
ocfs2_inode_info: 1960 - 1848 = 112
ocfs2_journal: 168 - 160 = 8
ocfs2_lock_res: 336 - 304 = 32
ocfs2_refcount_tree: 512 - 472 = 40
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2, actually we don't allow any direct write pass i_size,
see the function ocfs2_prepare_inode_for_write. So we don't
need the bogus simple_setsize.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Now orphan scan worker has no trace log, so it is
very hard to tell whether it is finished or blocked.
So add 2 mlog trace log so that we can tell whether
the current orphan scan worker is blocked or not.
It does help when I analyzed a orphan scan bug.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The reason why we need this ioctl is to offer the none-privileged
end-user a possibility to get filesys info gathering.
We use OCFS2_IOC_INFO to manipulate the new ioctl, userspace passes a
structure to kernel containing an array of request pointers and request
count, such as,
* From userspace:
struct ocfs2_info_blocksize oib = {
.ib_req = {
.ir_magic = OCFS2_INFO_MAGIC,
.ir_code = OCFS2_INFO_BLOCKSIZE,
...
}
...
}
struct ocfs2_info_clustersize oic = {
...
}
uint64_t reqs[2] = {(unsigned long)&oib,
(unsigned long)&oic};
struct ocfs2_info info = {
.oi_requests = reqs,
.oi_count = 2,
}
ret = ioctl(fd, OCFS2_IOC_INFO, &info);
* In kernel:
Get the request pointers from *info*, then handle each request one bye one.
Idea here is to make the spearated request small enough to guarantee
a better backward&forward compatibility since a small piece of request
would be less likely to be broken if filesys on raw disk get changed.
Currently, the following 7 requests are supported per the requirement from
userspace tool o2info, and I believe it will grow over time:-)
OCFS2_INFO_CLUSTERSIZE
OCFS2_INFO_BLOCKSIZE
OCFS2_INFO_MAXSLOTS
OCFS2_INFO_LABEL
OCFS2_INFO_UUID
OCFS2_INFO_FS_FEATURES
OCFS2_INFO_JOURNAL_SIZE
This ioctl is only specific to OCFS2.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The workqueue implementation in 2.6.36-rcX has changed, resulting
in the workqueues no longer having dedicated threads for work
processing. This has caused severe livelocks under heavy parallel
create workloads because the log IO completions have been getting
held up behind metadata IO completions. Hence log commits would
stall, memory allocation would stall because pages could not be
cleaned, and lock contention on the AIL during inode IO completion
processing was being seen to slow everything down even further.
By making the log Io completion workqueue a high priority workqueue,
they are queued ahead of all data/metadata IO completions and
processed before the data/metadata completions. Hence the log never
gets stalled, and operations needed to clean memory can continue as
quickly as possible. This avoids the livelock conditions and allos
the system to keep running under heavy load as per normal.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
An execve with a very large total of argument/environment strings
can take a really long time in the execve system call. It runs
uninterruptibly to count and copy all the strings. This change
makes it abort the exec quickly if sent a SIGKILL.
Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending(). It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.
Signed-off-by: Roland McGrath <roland@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings(). There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.
When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.
Signed-off-by: Roland McGrath <roland@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.
Check that the initial stack is not too large to make it possible
to map in any executable. We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit. So those
mappings might clobber part of the initial stack mapping. But
that is just userland lossage that userland made happen, not a
kernel problem.
Signed-off-by: Roland McGrath <roland@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Range check cpu in blk_cpu_to_group
scatterlist: prevent invalid free when alloc fails
writeback: Fix lost wake-up shutting down writeback thread
writeback: do not lose wakeup events when forking bdi threads
cciss: fix reporting of max queue depth since init
block: switch s390 tape_block and mg_disk to elevator_change()
block: add function call to switch the IO scheduler from a driver
fs/bio-integrity.c: return -ENOMEM on kmalloc failure
bio-integrity.c: remove dependency on __GFP_NOFAIL
BLOCK: fix bio.bi_rw handling
block: put dev->kobj in blk_register_queue fail path
cciss: handle allocation failure
cfq-iosched: Documentation help for new tunables
cfq-iosched: blktrace print per slice sector stats
cfq-iosched: Implement tunable group_idle
cfq-iosched: Do group share accounting in IOPS when slice_idle=0
cfq-iosched: Do not idle if slice_idle=0
cciss: disable doorbell reset on reset_devices
blkio: Fix return code for mkdir calls
The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
bytes of uninitialized stack memory, because the fsxattr struct
declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
the 12-byte fsx_pad member before copying it back to the user. This
patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
This flag was only set for barrier buffers, which we don't submit
anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently JBD2 relies blkdev_issue_flush() draining the queue when ASYNC_COMMIT
feature is set. This property is going away so make JBD2 wait for buffers it
needs on its own before submitting the cache flush.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
tj: nilfs is now fixed to wait for discard completion. Updated this
patch accordingly and dropped warning about it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
detection for barriers. Note that reiserfs had a fairly different code
path for barriers before as it wa the only filesystem actually making use
of them. The new code always uses the old non-barrier codepath and just
sets the WRITE_FLUSH_FUA explicitly for the journal commits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
detection for barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We'll need to get rid of the BLKDEV_IFL_BARRIER flag, and to facilitate
that and to make the interface less confusing pass all flags explicitly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
O_NONBLOCK on parisc has a dual value:
#define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */
It is caught by the O_* bits uniqueness check and leads to a parisc
compile error. The fix would be to take O_NONBLOCK out.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Cc: Jamie Lokier <jamie@shareable.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 74641f584d ("alpha: binfmt_aout fix") (May 2009) introduced a
regression - binfmt_misc is now consulted after binfmt_elf, which will
unfortunately break ia32el. ia32 ELF binaries on ia64 used to be matched
using binfmt_misc and executed using wrapper. As 32bit binaries are now
matched by binfmt_elf before bindmt_misc kicks in, the wrapper is ignored.
The fix increases precedence of binfmt_misc to the original state.
Signed-off-by: Jan Sembera <jsembera@suse.cz>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net
Cc: <stable@kernel.org> [2.6.everything.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the left-over old ifdef for PG_uncached in /proc/kpageflags. Now it's
used by x86, too.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit c2c6ca4 (direct-io: do not merge logically non-contiguous requests)
introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
to the block layer. The problem is that the code expected
dio->block_in_file to correspond to the current page in the dio. In fact,
it corresponds to the previous page submitted via submit_page_section.
This was purely an oversight, as the dio->cur_page_fs_offset field was
introduced for just this purpose. This patch simply uses the correct
variable when calculating whether there is a mismatch between contiguous
logical blocks and contiguous physical blocks (as described in the
comments).
I also switched the if conditional following this check to an else if, to
ensure that we never call dio_bio_submit twice for the same dio (in
theory, this should not happen, anyway).
I've tested this by running blktrace and verifying that a 64KB I/O was
submitted as a single I/O. I also ran the patched kernel through
xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
and verified that there were no regressions as compared to an unpatched
kernel.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Josef Bacik <jbacik@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: <stable@kernel.org> [2.6.35.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So it can be used by all that need to check for that.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'fixes' of git://oss.oracle.com/git/tma/linux-2.6:
ocfs2: Fix orphan add in ocfs2_create_inode_in_orphan
ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functions
ocfs2: allow return of new inode block location before allocation of the inode
ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open coding
ocfs2: split out inode alloc code from ocfs2_mknod_locked
Ocfs2: Fix a regression bug from mainline commit(6b933c8e6f).
ocfs2: Fix deadlock when allocating page
ocfs2: properly set and use inode group alloc hint
ocfs2: Use the right group in nfs sync check.
ocfs2: Flush drive's caches on fdatasync
ocfs2: make __ocfs2_page_mkwrite handle file end properly.
ocfs2: Fix incorrect checksum validation error
ocfs2: Fix metaecc error messages
cifs_demultiplex_thread sets the addr.sockAddr.sin_port without any
regard for the socket family. While it may be that the error in question
here never occurs on an IPv6 socket, it's probably best to be safe and
set the port properly if it ever does.
Break the port setting code out of cifs_fill_sockaddr and into a new
function, and call that from cifs_demultiplex_thread.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If the tcpStatus is still CifsNew, the main cifs_demultiplex_loop can
break out prematurely in some cases. This is wrong as we will almost
always have other structures with pointers to the TCP_Server_Info. If
the main loop breaks under any other condition other than tcpStatus ==
CifsExiting, then it'll face a use-after-free situation.
I don't see any reason to treat a CifsNew tcpStatus differently than
CifsGood. I believe we'll still want to attempt to reconnect in either
case. What should happen in those situations is that the MIDs get marked
as MID_RETRY_NEEDED. This will make CIFSSMBNegotiate return -EAGAIN, and
then the caller can retry the whole thing on a newly reconnected socket.
If that fails again in the same way, the caller of cifs_get_smb_ses
should tear down the TCP_Server_Info struct.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When cifs_demultiplex_thread exits, it does a number of cleanup tasks
including freeing the TCP_Server_Info struct. Much of the existing code
in cifs assumes that when there is a cisfSesInfo struct, that it holds a
reference to a valid TCP_Server_Info struct.
We can never allow cifsd to exit when a cifsSesInfo struct is still
holding a reference to the server. The server pointers will then point
to freed memory.
This patch eliminates a couple of questionable conditions where it does
this. The idea here is to make an -EINTR return from kernel_recvmsg
behave the same way as -ERESTARTSYS or -EAGAIN. If the task was
signalled from cifs_put_tcp_session, then tcpStatus will be CifsExiting,
and the kernel_recvmsg call will return quickly.
There's also another condition where this can occur too -- if the
tcpStatus is still in CifsNew, then it will also exit if the server
closes the socket prematurely. I think we'll probably also need to fix
that situation, but that requires a bit more consideration.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This function is not used, so remove the definition and declaration.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The VFS always checks that the source and target of a rename are on the
same vfsmount, and hence have the same superblock. So, this check is
redundant. Remove it and simplify the error handling.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This reverts commit 9fbc590860.
The change to kernel crypto and fixes to ntlvm2 and ntlmssp
series, introduced a regression. Deferring this patch series
to 2.6.37 after Shirish fixes it.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
This reverts commit 3ec6bbcdb4.
The change to kernel crypto and fixes to ntlvm2 and ntlmssp
series, introduced a regression. Deferring this patch series
to 2.6.37 after Shirish fixes it.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
This reverts commit 2d20ca8358.
The change to kernel crypto and fixes to ntlvm2 and ntlmssp
series, introduced a regression. Deferring this patch series
to 2.6.37 after Shirish fixes it.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
The change to kernel crypto and fixes to ntlvm2 and ntlmssp
series, introduced a regression. Deferring this patch series
to 2.6.37 after Shirish fixes it.
This reverts commit c89e5198b2.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix lock annotations
fuse: flush background queue on connection close
ocfs2_create_inode_in_orphan() is used by reflink to create the newly
reflinked inode simultaneously in the orphan dir. This allows us to easily
handle partially-reflinked files during recovery cleanup.
We have a problem though - the orphan dir stringifies inode # to determine
a unique name under which the orphan entry dirent can be created. Since
ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir
before it can allocate the inode, we currently call into the orphan code:
/*
* We give the orphan dir the root blkno to fake an orphan name,
* and allocate enough space for our insertion.
*/
status = ocfs2_prepare_orphan_dir(osb, &orphan_dir,
osb->root_blkno,
orphan_name, &orphan_insert);
Using osb->root_blkno might work fine on unindexed directories, but the
orphan dir can have an index. When it has that index, the above code fails
to allocate the proper index entry. Later, when we try to remove the file
from the orphan dir (using the actual inode #), the reflink operation will
fail.
To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the
newly split out orphan and inode alloc code to figure out what the inode
block number will be (once allocated) and then prepare the orphan dir from
that data.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We do this because ocfs2_create_inode_in_orphan() wants to order locking of
the orphan dir with respect to locking of the inode allocator *before*
making any changes to the directory.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
This allows code which needs to know the eventual block number of an inode
but can't allocate it yet due to transaction or lock ordering. For example,
ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
of the orphan dir because it can't yet know where the actual inode is placed
- that code is actually in ocfs2_mknod_locked. This is a problem when the
orphan dirs are indexed as the junk inode number will create an index entry
which goes unused (and fails the later removal from the orphan dir). Now
with these interfaces, ocfs2_create_inode_in_orphan() can run the block
group search (and get back the inode block number) *before* any actual
allocation occurs.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
ocfs2_search_chain() makes the same updates as
ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding
the bitmap update, use our helper function.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Do this by splitting the bulk of the function away from the inode allocation
code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to
change and won't see any difference. The new function created,
__ocfs2_mknod_locked() will be used shortly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
The patch is to fix the regression bug brought from commit 6b933c8...( 'ocfs2:
Avoid direct write if we fall back to buffered I/O'):
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285
The commit 6b933c8e6f changed __generic_file_aio_write
to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to flush
the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt
the O_DIRECT semantics somehow in extented odirect writes.
This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly
flushed.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We cannot call grab_cache_page() when holding filesystem locks or with
a transaction started as grab_cache_page() calls page allocation with
GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
causing deadlocks or various assertion failures. We have to use
find_or_create_page() instead and pass it GFP_NOFS as we do with other
allocations.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from
res->sr_bg_blkno. Unfortunately, res->sr_bg_blkno is going to be zero under
normal (non-fragmented) circumstances. The discontig block group patches
effectively turned off that feature. Fix this by correctly calculating what
the next group hint should be.
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We have added discontig block group now, and now an inode
can be allocated in an discontig block group. So get
it in ocfs2_get_suballoc_slot_bit.
The old ocfs2_test_suballoc_bit gets group block no
from the allocation inode which is wrong. Fix it by
passing the right group.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
When 'barrier' mount option is specified, we have to issue a cache flush
during fdatasync(2). We have to do this even if inode doesn't have
I_DIRTY_DATASYNC set because we still have to get written *data* to disk so
that they are not lost in case of crash.
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Singed-off-by: Tao Ma <tao.ma@oracle.com>
__ocfs2_page_mkwrite now is broken in handling file end.
1. the last page should be the page contains i_size - 1.
2. the len in the last page is also calculated wrong.
So change them accordingly.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to
read the inode off the disk. The latter first checks to see if that block is
cached in the journal, and, if so, returns that block. That is ok.
But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum
of such blocks. Blocks that are cached in the journal may not have had their
checksum computed as yet. We should not validate the checksums of such blocks.
Fixes ossbz#1282
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Singed-off-by: Tao Ma <tao.ma@oracle.com>
Like tools, the checksum validate function now prints the values in hex.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Singed-off-by: Tao Ma <tao.ma@oracle.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: Make fiemap work with sparse files
xfs: prevent 32bit overflow in space reservation
xfs: Disallow 32bit project quota id
xfs: improve buffer cache hash scalability
Sanity check the flags passed to change_mnt_propagation(). Exactly
one flag should be set. Return EINVAL otherwise.
Userspace can pass in arbitrary combinations of MS_* flags to mount().
do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
or MS_UNBINDABLE is set. do_change_type() clears MS_REC and then
calls change_mnt_propagation() with the rest of the user-supplied
flags. change_mnt_propagation() clearly assumes only one flag is set
but do_change_type() does not check that this is true. For example,
mount() with flags MS_SHARED | MS_RDONLY does not actually make the
mount shared or read-only but does clear MNT_UNBINDABLE.
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse doesn't understand lock annotations of the form
__releases(&foo->lock). Change them to __releases(foo->lock). Same
for __acquires().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
David Bartly reported that fuse can hang in fuse_get_req_nofail() when
the connection to the filesystem server is no longer active.
If bg_queue is not empty then flush_bg_queue() called from
request_end() can put more requests on to the pending queue. If this
happens while ending requests on the processing queue then those
background requests will be queued to the pending list and never
ended.
Another problem is that fuse_dev_release() didn't wake up processes
sleeping on blocked_waitq.
Solve this by:
a) flushing the background queue before calling end_requests() on the
pending and processing queues
b) setting blocked = 0 and waking up processes waiting on
blocked_waitq()
Thanks to David for an excellent bug report.
Reported-by: David Bartley <andareed@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Function pnode_lookup may return ERR_PTR(...). Check for it.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Function ubifs_lpt_lookup may return ERR_PTR(...). Check for it.
[Tweaked by Artem Bityutskiy]
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
d_path() returns an ERR_PTR and it doesn't return NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: stable <stable@kernel.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When converting a lock, an lkb is in the granted state and also being used
to request a new state. In the case that the conversion was a "try 1cb"
type which has failed, and if the new state was incompatible with the old
state, a callback was being generated to the requesting node. This is
incorrect as callbacks should only be sent to all the other nodes holding
blocking locks. The requesting node should receive the normal (failed)
response to its "try 1cb" conversion request only.
This was discovered while debugging a performance problem on GFS2, however
this fix also speeds up GFS as well. In the GFS2 case the performance gain
is over 10x for cases of write activity to an inode whose glock is cached
on another, idle (wrt that glock) node.
(comment added, dct)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
to return fi_extent_max extents, but actually it won't work for
a sparse file. The reason is that in xfs_getbmap we will
calculate holes and set it in 'out', while out is malloced by
bmv_count(fi_extent_max+1) which didn't consider holes. So in the
worst case, if 'out' vector looks like
[hole, extent, hole, extent, hole, ... hole, extent, hole],
we will only return half of fi_extent_max extents.
This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
So with this flags, we don't use our 'out' in xfs_getbmap for
a hole. The solution is a bit ugly by just don't increasing
index of 'out' vector. I felt that it is not easy to skip it
at the very beginning since we have the complicated check and
some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
If we attempt to preallocate more than 2^32 blocks of space in a
single syscall, the transaction block reservation will overflow
leading to a hangs in the superblock block accounting code. This
is trivially reproduced with xfs_io. Fix the problem by capping the
allocation reservation to the maximum number of blocks a single
xfs_bmapi() call can allocate (2^21 blocks).
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently on-disk structure is able to keep only 16bit project quota
id, so disallow 32bit ones. This fixes a problem where parts of
kernel structures holding project quota id are 32bit while parts
(on-disk) are 16bit variables which causes project quota member
files to be inaccessible for some operations (like mv/rm).
Signed-off-by: Arkadiusz Mi?kiewicz <arekm@maven.pl>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When doing large parallel file creates on a 16p machines, large amounts of
time is being spent in _xfs_buf_find(). A system wide profile with perf top
shows this:
1134740.00 19.3% _xfs_buf_find
733142.00 12.5% __ticket_spin_lock
The problem is that the hash contains 45,000 buffers, and the hash table width
is only 256 buffers. That means we've got around 200 buffers per chain, and
searching it is quite expensive. The hash table size needs to increase.
Secondly, every time we do a lookup, we promote the buffer we find to the head
of the hash chain. This is causing cachelines to be dirtied and causes
invalidation of cachelines across all CPUs that may have walked the hash chain
recently. hence every walk of the hash chain is effectively a cold cache walk.
Remove the promotion to avoid this invalidation.
The results are:
1045043.00 21.2% __ticket_spin_lock
326184.00 6.6% _xfs_buf_find
A 70% drop in the CPU usage when looking up buffers. Unfortunately that does
not result in an increase in performance underthis workload as contention on
the inode_lock soaks up most of the reduction in CPU usage.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
p9_client_walk() can return error values if we run out of space or there
is a problem with the network.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
When an error happens during validation of read node, the typical situation is that
the LEB we read is unmapped (due to some bug). It is handy to include the mapping
status into the error message.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The UBIFS bug in the GC list sorting comparison functions inspired
me to write internal debugging check functions which verify that
the list of nodes is sorted properly.
So, this patch implements 2 new debugging functions:
o 'dbg_check_data_nodes_order()' - check order of data nodes list
o 'dbg_check_nondata_nodes_order()' - check order of non-data nodes list
The debugging functions are executed only if general UBIFS debugging checks are
enabled. And they are compiled out if UBIFS debugging is disabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When running the integrity test ('integck' from mtd-utils) on current
UBIFS on 2.6.35, I see that assertions in UBIFS 'list_sort()' comparison
functions trigger sometimes, e.g.:
UBIFS assert failed in data_nodes_cmp at 132 (pid 28311)
My investigation showed that this happens when 'list_sort()' calls the 'cmp()'
function with equivalent arguments. In this case, the 'struct list_head'
parameter, passed to 'cmp()' is bogus, and it does not belong to any element in
the original list.
And this issue seems to be introduced by commit:
commit 835cc0c847
Author: Don Mullis <don.mullis@gmail.com>
Date: Fri Mar 5 13:43:15 2010 -0800
It is easy to work around the issue by doing:
if (a == b)
return 0;
in UBIFS. It works, but 'lib_sort()' should nevertheless be fixed. Although it
is harmless to have this piece of code in UBIFS.
This patch adds that code to both UBIFS 'cmp()' functions:
'data_nodes_cmp()' and 'nondata_nodes_cmp()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When scanning the flash, UBIFS builds a list of flash nodes of type
'struct ubifs_scan_node'. Each scanned node has a 'snod->key' field. This field
is valid for most of the nodes, but invalid for some node type, e.g., truncation
nodes. It is safer to explicitly initialize such keys to something invalid,
rather than leaving them initialized to all zeros, which has key type of
UBIFS_INO_KEY.
This patch introduces new "fake" key type UBIFS_INVALID_KEY and initializes
unused 'snod->key' objects to this type. It also adds debugging assertions in
the TNC code to make sure no one ever tries to look these nodes up in the TNC.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In the scanning code, in 'ubifs_add_snod()', we write rubbish into
'snod->key', because we assume that on-flash truncation nodes have a key, but
they do not. If the other parts of UBIFS then mistakenly try to look-up
the truncation node key (they should not do this, but may do because of a bug),
we can succeed and corrupt TNC. It looks like we did have such a situation in
'sort_nodes()' in gc.c.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Improve assertions in gc.c in the comparison functions for 'list_sort()': check
key types _and_ node types.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In comparison function for 'list_sort()' we use key type to distinguish between
node types. However, we have a bit simper way to detect node type -
'snod->type'. This more logical to use, comparing to decoding key types. Also
allows to get rid of 2 local variables.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When moving nodes in GC, do not try to look up truncation nodes in TNC,
because they do not exist there. This would be harmless, because the TNC
look-up would fail, if we did not have bug 'ubifs_add_snod()' which reads
garbage into 'snod->key'. But in any case, it is less error prone to
explicitly ignore everything but inode, data, dentry and xentry nodes.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch fixes the following false assertion warning:
UBIFS assert failed in data_nodes_cmp at 130 (pid 15107)
The assertion was wrong because it did not take into account that the
node can be an xentry.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
'ubifs_garbage_collect_leb()' should never return '-ENOSPC', and if it
does, this is an error. Thus, do not treat this error code specially.
'-EAGAIN' is a special error code, but not '-ENOSPC'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In 'ubifs_garbage_collect()' on error path, we first switch to R/O mode, and
then synchronize write-buffers (to make sure no data are lost). But the GC
write-buffer synchronization will fail, because we are already in R/O mode.
This patch re-orders this and makes sure we first synchronize the write-buffer,
and then switch to R/O mode.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
If load_nilfs() gets an error while doing recovery, it will fail to
free the shadow inode of dat (nilfs->ns_gc_dat).
This fixes the leak issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
fsnotify: drop two useless bools in the fnsotify main loop
fsnotify: fix list walk order
fanotify: Return EPERM when a process is not privileged
fanotify: resize pid and reorder structure
fanotify: drop duplicate pr_debug statement
fanotify: flush outstanding perm requests on group destroy
fsnotify: fix ignored mask handling between inode and vfsmount marks
fanotify: add MAINTAINERS entry
fsnotify: reset used_inode and used_vfsmount on each pass
fanotify: do not dereference inode_mark when it is unset
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix get_ticket_handler() error handling
ceph: don't BUG on ENOMEM during mds reconnect
ceph: ceph_mdsc_build_path() returns an ERR_PTR
ceph: Fix warnings
ceph: ceph_get_inode() returns an ERR_PTR
ceph: initialize fields on new dentry_infos
ceph: maintain i_head_snapc when any caps are dirty, not just for data
ceph: fix osd request lru adjustment when sending request
ceph: don't improperly set dir complete when holding EXCL cap
mm: exporting account_page_dirty
ceph: direct requests in snapped namespace based on nonsnap parent
ceph: queue cap snap writeback for realm children on snap update
ceph: include dirty xattrs state in snapped caps
ceph: fix xattr cap writeback
ceph: fix multiple mds session shutdown
* 'for-2.6.36' of git://linux-nfs.org/~bfields/linux:
nfsd: fix NULL dereference in nfsd_statfs()
nfsd4: fix downgrade/lock logic
nfsd4: typo fix in find_any_file
nfsd4: bad BUG() in preprocess_stateid_op
Setting the task state here may cause us to miss the wake up from
kthread_stop(), so we need to recheck kthread_should_stop() or risk
sleeping forever in the following schedule().
Symptom was an indefinite hang on an NFSv4 mount. (NFSv4 may create
multiple mounts in a temporary namespace while traversing the mount
path, and since the temporary namespace is immediately destroyed, it may
end up destroying a mount very soon after it was created, possibly
making this race more likely.)
INFO: task mount.nfs4:4314 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4 D 0000000000000000 2880 4314 4313 0x00000000
ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8
ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0
ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8
Call Trace:
[<ffffffff8196090d>] schedule_timeout+0x1cd/0x2e0
[<ffffffff8106a31c>] ? mark_held_locks+0x6c/0xa0
[<ffffffff819639a0>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff8106a5fd>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff819671fe>] ? sub_preempt_count+0xe/0xd0
[<ffffffff8195fc80>] wait_for_common+0x120/0x190
[<ffffffff81033c70>] ? default_wake_function+0x0/0x20
[<ffffffff8195fdcd>] wait_for_completion+0x1d/0x20
[<ffffffff810595fa>] kthread_stop+0x4a/0x150
[<ffffffff81061a60>] ? thaw_process+0x70/0x80
[<ffffffff810cc68a>] bdi_unregister+0x10a/0x1a0
[<ffffffff81229dc9>] nfs_put_super+0x19/0x20
[<ffffffff810ee8c4>] generic_shutdown_super+0x54/0xe0
[<ffffffff810ee9b6>] kill_anon_super+0x16/0x60
[<ffffffff8122d3b9>] nfs4_kill_super+0x39/0x90
[<ffffffff810eda45>] deactivate_locked_super+0x45/0x60
[<ffffffff810edfb9>] deactivate_super+0x49/0x70
[<ffffffff81108294>] mntput_no_expire+0x84/0xe0
[<ffffffff811084ef>] release_mounts+0x9f/0xc0
[<ffffffff81108575>] put_mnt_ns+0x65/0x80
[<ffffffff8122cc56>] nfs_follow_remote_path+0x1e6/0x420
[<ffffffff8122cfbf>] nfs4_try_mount+0x6f/0xd0
[<ffffffff8122d0c2>] nfs4_get_sb+0xa2/0x360
[<ffffffff810edcb8>] vfs_kern_mount+0x88/0x1f0
[<ffffffff810ede92>] do_kern_mount+0x52/0x130
[<ffffffff81963d9a>] ? _lock_kernel+0x6a/0x170
[<ffffffff81108e9e>] do_mount+0x26e/0x7f0
[<ffffffff81106b3a>] ? copy_mount_options+0xea/0x190
[<ffffffff811094b8>] sys_mount+0x98/0xf0
[<ffffffff810024d8>] system_call_fastpath+0x16/0x1b
1 lock held by mount.nfs4/4314:
#0: (&type->s_umount_key#24){+.+...}, at: [<ffffffff810edfb1>] deactivate_super+0x41/0x70
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The fsnotify main loop has 2 bools which indicated if we processed the
inode or vfsmount mark in that particular pass through the loop. These
bool can we replaced with the inode_group and vfsmount_group variables
and actually make the code a little easier to understand.
Signed-off-by: Eric Paris <eparis@redhat.com>
Marks were stored on the inode and vfsmonut mark list in order from
highest memory address to lowest memory address. The code to walk those
lists thought they were in order from lowest to highest with
unpredictable results when trying to match up marks from each. It was
possible that extra events would be sent to userspace when inode
marks ignoring events wouldn't get matched with the vfsmount marks.
This problem only affected fanotify when using both vfsmount and inode
marks simultaneously.
Signed-off-by: Eric Paris <eparis@redhat.com>
The appropriate error code when privileged operations are denied is
EPERM, not EACCES.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <paris@paris.rdu.redhat.com>
Fixes a regression caused by 21edad3220
When file name encryption was enabled, ecryptfs_lookup() failed to use
the encrypted and encoded version of the upper, plaintext, file name
when performing a lookup in the lower file system. This made it
impossible to lookup existing encrypted file names and any newly created
files would have plaintext file names in the lower file system.
https://bugs.launchpad.net/ecryptfs/+bug/623087
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Some ecryptfs init functions are not prefixed by __init and thus not
freed after initialization. This patch saved about 1kB in ecryptfs
module.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
In this code, 0 is returned on memory allocation failure, even though other
failures return -ENOMEM or other similar values.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression ret;
expression x,e1,e2,e3;
@@
ret = 0
... when != ret = e1
*x = \(kmalloc\|kcalloc\|kzalloc\)(...)
... when != ret = e2
if (x == NULL) { ... when != ret = e3
return ret;
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
The commit ebabe9a900
pass a struct path to vfs_statfs
introduced the struct path initialization, and this seems to trigger
an Oops on my machine.
fh_dentry field may be NULL and set later in fh_verify(), thus the
initialization of path must be after fh_verify().
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If we already had a RW open for a file, and get a readonly open, we were
piggybacking on the existing RW open. That's inconsistent with the
downgrade logic which blows away the RW open assuming you'll still have
a readonly open.
Also, make sure there is a readonly or writeonly open available for
locking, again to prevent bad behavior in downgrade cases when any RW
open may be lost.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It's OK for this function to return without setting filp--we do it in
the special-stateid case.
And there's a legitimate case where we can hit this, since we do permit
reads on write-only stateid's.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
On 08/26/2010 01:56 AM, joe hefner wrote:
> On a recent Fedora (13), I am seeing a mount failure message that I can not explain. I have a Windows Server 2003ýa with a share set up for access only for a specific username (say userfoo). If I try to mount it from Linux,ýusing userfoo and the correct password all is well. If I try with a bad password or with some other username (userbar), it fails with "Permission denied" as expected. If I try to mount as username = administrator, and give the correct administrator password, I would also expect "Permission denied", but I see "Cannot allocate memory" instead.
> ýfs/cifs/netmisc.c: Mapping smb error code 5 to POSIX err -13
> ýfs/cifs/cifssmb.c: Send error in QPathInfo = -13
> ýCIFS VFS: cifs_read_super: get root inode failed
Looks like the commit 0b8f18e3 assumed that cifs_get_inode_info() and
friends fail only due to memory allocation error when the inode is NULL
which is not the case if CIFSSMBQPathInfo() fails and returns an error.
Fix this by propagating the actual error code back.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
get_ticket_handler() returns a valid pointer or it returns
ERR_PTR(-ENOMEM) if kzalloc() fails.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_mdsc_build_path() returns an ERR_PTR but this code is set up to
handle NULL returns.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Just scrubbing some warnings so I can see real problem ones in the build
noise. For 32bit we need to coax gcc politely into believing we really
honestly intend to the casts. Using (u64)(unsigned long) means we cast from
a pointer to a type of the right size and then extend it. This stops the
warning spew.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_get_inode() returns an ERR_PTR and it doesn't return a NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
Eliminate sparse warning - bad constant expression
cifs: check for NULL session password
missing changes during ntlmv2/ntlmssp auth and sign
[CIFS] Fix ntlmv2 auth with ntlmssp
cifs: correction of unicode header files
cifs: fix NULL pointer dereference in cifs_find_smb_ses
cifs: consolidate error handling in several functions
cifs: clean up error handling in cifs_mknod
We used to use i_head_snapc to keep track of which snapc the current epoch
of dirty data was dirtied under. It is used by queue_cap_snap to set up
the cap_snap. However, since we queue cap snaps for any dirty caps, not
just for dirty file data, we need to keep a valid i_head_snapc anytime
we have dirty|flushing caps. This fixes a NULL pointer deref in
queue_cap_snap when writing back dirty caps without data (e.g.,
snaptest-authwb.sh).
Signed-off-by: Sage Weil <sage@newdream.net>
Eliminiate sparse warning during usage of crypto_shash_* APIs
error: bad constant expression
Allocate memory for shash descriptors once, so that we do not kmalloc/kfree it
for every signature generation (shash descriptor for md5 hash).
From ed7538619817777decc44b5660b52268077b74f3 Mon Sep 17 00:00:00 2001
From: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Date: Tue, 24 Aug 2010 11:47:43 -0500
Subject: [PATCH] eliminate sparse warnings during crypto_shash_* APis usage
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If xfs_map_blocks returns EAGAIN because of lock contention we must redirty the
page and not disard the pagecache content and return an error from writepage.
We used to do this correctly, but the logic got lost during the recent
reshuffle of the writepage code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Mike Gao <ygao.linux@gmail.com>
Tested-by: Mike Gao <ygao.linux@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Formatting items requires memory allocation when using delayed
logging. Currently that memory allocation is done while holding the
CIL context lock in read mode. This means that if memory allocation
takes some time (e.g. enters reclaim), we cannot push on the CIL
until the allocation(s) required by formatting complete. This can
stall CIL pushes for some time, and once a push is stalled so are
all new transaction commits.
Fix this splitting the item formatting into two steps. The first
step which does the allocation and memcpy() into the allocated
buffer is now done outside the CIL context lock, and only the CIL
insert is done inside the CIL context lock. This avoids the stall
issue.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Delayed logging adds some serialisation to the log force process to
ensure that it does not deference a bad commit context structure
when determining if a CIL push is necessary or not. It does this by
grabing the CIL context lock exclusively, then dropping it before
pushing the CIL if necessary. This causes serialisation of all log
forces and pushes regardless of whether a force is necessary or not.
As a result fsync heavy workloads (like dbench) can be significantly
slower with delayed logging than without.
To avoid this penalty, copy the current sequence from the context to
the CIL structure when they are swapped. This allows us to do
unlocked checks on the current sequence without having to worry
about dereferencing context structures that may have already been
freed. Hence we can remove the CIL context locking in the forcing
code and only call into the push code if the current context matches
the sequence we need to force.
By passing the sequence into the push code, we can check the
sequence again once we have the CIL lock held exclusive and abort if
the sequence has already been pushed. This avoids a lock round-trip
and unnecessary CIL pushes when we have racing push calls.
The result is that the regression in dbench performance goes away -
this change improves dbench performance on a ramdisk from ~2100MB/s
to ~2500MB/s. This compares favourably to not using delayed logging
which retuns ~2500MB/s for the same workload.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we need to cover the log, we issue dummy transactions to ensure
the current log tail is on disk. Unfortunately we currently use the
root inode in the dummy transaction, and the act of committing the
transaction dirties the inode at the VFS level.
As a result, the VFS writeback of the dirty inode will prevent the
filesystem from idling long enough for the log covering state
machine to complete. The state machine gets stuck in a loop issuing
new dummy transactions to cover the log and never makes progress.
To avoid this problem, the dummy transactions should not cause
externally visible state changes. To ensure this occurs, make sure
that dummy transactions log an unchanging field in the superblock as
it's state is never propagated outside the filesystem. This allows
the log covering state machine to complete successfully and the
filesystem now correctly enters a fully idle state about 90s after
the last modification was made.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Because of delayed updates to sb_icount field in the super block, it
is possible to allocate over maxicount number of inodes. This
causes the arithmetic to calculate a negative number of free inodes
in user commands like df or stat -f.
Since maxicount is a somewhat arbitrary number, a slight over
allocation is not critical but user commands should be displayed as
0 or greater and never go negative. To do this the value in the
stats buffer f_ffree is capped to never go negative.
[ Modified to use max_t as per Christoph's comment. ]
Signed-off-by: Stu Brodsky <sbrodsky@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
During data integrity (WB_SYNC_ALL) writeback, wbc->nr_to_write will
go negative on inodes with more than 1024 dirty pages due to
implementation details of write_cache_pages(). Currently XFS will
abort page clustering in writeback once nr_to_write drops below
zero, and so for data integrity writeback we will do very
inefficient page at a time allocation and IO submission for inodes
with large numbers of dirty pages.
Fix this by only aborting the page clustering code when
wbc->nr_to_write is negative and the sync mode is WB_SYNC_NONE.
Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit 7124fe0a5b ("xfs: validate untrusted inode
numbers during lookup") changes the inode lookup code to do btree lookups for
untrusted inode numbers. This change made an invalid assumption about the
alignment of inodes and hence incorrectly calculated the first inode in the
cluster. As a result, some inode numbers were being incorrectly considered
invalid when they were actually valid.
The issue was not picked up by the xfstests suite because it always runs fsr
and dump (the two utilities that utilise the bulkstat interface) on cache hot
inodes and hence the lookup code in the cold cache path was not sufficiently
exercised to uncover this intermittent problem.
Fix the issue by relaxing the btree lookup criteria and then checking if the
record returned contains the inode number we are lookup for. If it we get an
incorrect record, then the inode number is invalid.
Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Under heavy load parallel metadata loads (e.g. dbench), we can fail
to mark all the inodes in a cluster being freed as XFS_ISTALE as we
skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on.
When this happens and the inode cluster buffer has already been
marked stale and freed, inode reclaim can try to write the inode out
as it is dirty and not marked stale. This can result in writing th
metadata to an freed extent, or in the case it has already
been overwritten trigger a magic number check failure and return an
EUCLEAN error such as:
Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117
Fix this by ensuring that we hoover up all in memory inodes in the
cluster and mark them XFS_ISTALE when freeing the cluster.
Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we commit a transaction using delayed logging, we need to
unlock the items in the transaciton before we unlock the CIL context
and allow it to be checkpointed. If we unlock them after we release
the CIl context lock, the CIL can checkpoint and complete before
we free the log items. This breaks stale buffer item unlock and
unpin processing as there is an implicit assumption that the unlock
will occur before the unpin.
Also, some log items need to store the LSN of the transaction commit
in the item (inodes and EFIs) and so can race with other transaction
completions if we don't prevent the CIL from checkpointing before
the unlock occurs.
Cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
It's possible for a cifsSesInfo struct to have a NULL password, so we
need to check for that prior to running strncmp on it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The kmalloc() in bio_integrity_prep() is failable, so remove __GFP_NOFAIL
from its mask.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fix argument order. We want to move the item to the end of the list, not
change the position of the head.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we hold the EXCL cap, we cannot trust the dir stats from the MDS (num
files, subdirs) and must not incorrectly conclude that the directory is
empty. If we do, we get can bad results from lookup (bad ENOENT) and
bad readdir results.
Signed-off-by: Sage Weil <sage@newdream.net>
This reminded me... you have two pr_debugs in fanotify_should_send_event
which output redundant information. Maybe you intended it like that so
it is selectable how much log spam you want, or if not you may want to
apply this patch.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
When an fanotify listener is closing it may cause a deadlock between the
listener and the original task doing an fs operation. If the original task
is waiting for a permissions response it will be holding the srcu lock. The
listener cannot clean up and exit until after that srcu lock is syncronized.
Thus deadlock. The fix introduced here is to stop accepting new permissions
events when a listener is shutting down and to grant permission for all
outstanding events. Thus the original task will eventually release the srcu
lock and the listener can complete shutdown.
Reported-by: Andreas Gruenbacher <agruen@suse.de>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
The interesting 2 list lockstep walking didn't quite work out if the inode
marks only had ignores and the vfsmount list requested events. The code to
shortcut list traversal would not run the inode list since it didn't have real
event requests. This code forces inode list traversal when a vfsmount mark
matches the event type. Maybe we could add an i_fsnotify_ignored_mask field
to struct inode to get the shortcut back, but it doesn't seem worth it to grow
struct inode again.
I bet with the recent changes to lock the way we do now it would actually not
be a major perf hit to just drop i_fsnotify_mark_mask altogether. But that is
for another day.
Signed-off-by: Eric Paris <eparis@redhat.com>
The fsnotify main loop has 2 booleans which tell if a particular mark was
sent to the listeners or if it should be processed in the next pass. The
problem is that the booleans were not reset on each traversal of the loop.
So marks could get skipped even when they were not sent to the notifiers.
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The fanotify code is supposed to get the group from the mark. It accidentally
only used the inode_mark. If the vfsmount_mark was set but not the inode_mark
it would deref the NULL inode_mark. Get the group from the correct place.
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.
Signed-off-by: Michael Rubin <mrubin@google.com>
Signed-off-by: Sage Weil <sage@newdream.net>
When making a request in the virtual snapdir or a snapped portion of the
namespace, we should choose the MDS based on the first nonsnap parent (and
its caps). If that is not the best place, we will get forward hints to
find the right MDS in the cluster. This fixes ESTALE errors when using
the .snap directory and namespace with multiple MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
When a realm is updated, we need to queue writeback on inodes in that
realm _and_ its children. Otherwise, if the inode gets cowed on the
server, we can get a hang later due to out-of-sync cap/snap state.
Signed-off-by: Sage Weil <sage@newdream.net>
When we snapshot dirty metadata that needs to be written back to the MDS,
include dirty xattr metadata. Make the capsnap reference the encoded
xattr blob so that it will be written back in the FLUSHSNAP op.
Also fix the capsnap creation guard to include dirty auth or file bits,
not just tests specific to dirty file data or file writes in progress
(this fixes auth metadata writeback).
Signed-off-by: Sage Weil <sage@newdream.net>
We should include the xattr metadata blob in the cap update message any
time we are flushing dirty state, NOT just when we are also dropping the
cap. This fixes async xattr writeback.
Also, clean up the code slightly to avoid duplicating the bit test.
Signed-off-by: Sage Weil <sage@newdream.net>
The use of a completion when waiting for session shutdown during umount is
inappropriate, given the complexity of the condition. For multiple MDS's,
this resulted in the umount thread spinning, often preventing the session
close message from being processed in some cases.
Switch to a waitqueue and defined a condition helper. This cleans things
up nicely.
Signed-off-by: Sage Weil <sage@newdream.net>
Make ntlmv2 as an authentication mechanism within ntlmssp
instead of ntlmv1.
Parse type 2 response in ntlmssp negotiation to pluck
AV pairs and use them to calculate ntlmv2 response token.
Also, assign domain name from the sever response in type 2
packet of ntlmssp and use that (netbios) domain name in
calculation of response.
Enable cifs/smb signing using rc4 and md5.
Changed name of the structure mac_key to session_key to reflect
the type of key it holds.
Use kernel crypto_shash_* APIs instead of the equivalent cifs functions.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch corrects a problem of compilation errors at removal of
UNIUPR_NOLOWER definition and adds include guards to cifs_unicode.h.
Signed-off-by: Igor Druzhinin <jaxbrigs@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix an Oops in the NFSv4 atomic open code
NFS: Fix the selection of security flavours in Kconfig
NFS: fix the return value of nfs_file_fsync()
rpcrdma: Fix SQ size calculation when memreg is FRMR
xprtrdma: Do not truncate iova_start values in frmr registrations.
nfs: Remove redundant NULL check upon kfree()
nfs: Add "lookupcache" to displayed mount options
NFS: allow close-to-open cache semantics to apply to root of NFS filesystem
SUNRPC: fix NFS client over TCP hangs due to packet loss (Bug 16494)
cifs_find_smb_ses assumes that the vol->password field is a valid
pointer, but that's only the case if a password was passed in via
the options string. It's possible that one won't be if there is
no mount helper on the box.
Reported-by: diabel <gacek-2004@wp.pl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: brlock vfsmount_lock
fs: scale files_lock
lglock: introduce special lglock and brlock spin locks
tty: fix fu_list abuse
fs: cleanup files_lock locking
fs: remove extra lookup in __lookup_hash
fs: fs_struct rwlock to spinlock
apparmor: use task path helpers
fs: dentry allocation consolidation
fs: fix do_lookup false negative
mbcache: Limit the maximum number of cache entries
hostfs ->follow_link() braino
hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy
remove SWRITE* I/O types
kill BH_Ordered flag
vfs: update ctime when changing the file's permission by setfacl
cramfs: only unlock new inodes
fix reiserfs_evict_inode end_writeback second call
nilfs_discard_segment() doesn't wait for completion of discard
requests. This specifies BLKDEV_IFL_WAIT flag when calling
blkdev_issue_discard() in order to fix the sync failure.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@lst.de>
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
throughput
2.6.34-rc2 24.5
+patch 24.9
us sys idle IO wait (in %)
2.6.34-rc2 51.25 28.25 17.25 3.25
+patch 53.75 18.5 19 8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
tty: fix fu_list abuse
tty code abuses fu_list, which causes a bug in remount,ro handling.
If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".
Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.
The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.
[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]
Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: cleanup files_lock locking
Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.
Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: remove extra lookup in __lookup_hash
Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.
Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
- We are interested in how the RCU lookup works here, particularly with
renames. Make that explicit, and point to the document where it is explained
in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
code used to work, more about the interesting parts of how it works now. So
comments about lazy LRU may be interesting, but would better be done in the
LRU or refcount management code.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: fs_struct rwlock to spinlock
struct fs_struct.lock is an rwlock with the read-side used to protect root and
pwd members while taking references to them. Taking a reference to a path
typically requires just 2 atomic ops, so the critical section is very small.
Parallel read-side operations would have cacheline contention on the lock, the
dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
real parallelism increase.
Replace it with a spinlock to avoid one or two atomic operations in typical
path lookup fastpath.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: dentry allocation consolidation
There are 2 duplicate copies of code in dentry allocation in path lookup.
Consolidate them into a single function.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: fix do_lookup false negative
In do_lookup, if we initially find no dentry, we take the directory i_mutex and
re-check the lookup. If we find a dentry there, then we revalidate it if
needed. However if that revalidate asks for the dentry to be invalidated, we
return -ENOENT from do_lookup. What should happen instead is an attempt to
allocate and lookup a new dentry.
This is probably not noticed because it is rare. It is only reached if a
concurrent create races in first (in which case, the dentry probably won't be
invalidated anyway), or if the racy __d_lookup has failed due to a
false-negative (which is very rare).
Fix this by removing code and have it use the normal reval path.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Limit the maximum number of mb_cache entries depending on the number of
hash buckets: if the only limit to the number of cache entries is the
available memory the hash chains can grow very long, taking a long time
to search.
At least partially solves https://bugzilla.lustre.org/show_bug.cgi?id=22771.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
we want the assignment to err done inside the if () to be
visible after it, so (re)declaring err inside if () body
is wrong.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These flags aren't real I/O types, but tell ll_rw_block to always
lock the buffer instead of giving up on a failed trylock.
Instead add a new write_dirty_buffer helper that implements this semantic
and use it from the existing SWRITE* callers. Note that the ll_rw_block
code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
this patch fixes.
In the ufs code clean up the helper that used to call ll_rw_block
to mirror sync_dirty_buffer, which is the function it implements for
compound buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of abusing a buffer_head flag just add a variant of
sync_dirty_buffer which allows passing the exact type of write
flag required.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
generic_acl_set didn't update the ctime of the file when its permission was
changed.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1275289822
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1275289822 <- unchanged
But, according to the spec of the ctime, vfs must update it.
Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 77b8a75f5b introduced a warning at fs/inode.c:692 unlock_new_inode(),
caused by unlock_new_inode() being called on existing inodes as well.
This patch changes setup_inode() to only call unlock_new_inode() for I_NEW
inodes.
Signed-off-by: Alexander Shishkin <virtuoso@slind.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
reiserfs_evict_inode calls end_writeback two times hitting
kernel BUG at fs/inode.c:298 becase inode->i_state is I_CLEAR already.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix false warning saying one of two super blocks is broken
nilfs2: fix list corruption after ifile creation failure
Make do_execve() take a const filename pointer so that kernel_execve() compiles
correctly on ARM:
arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type
This also requires the argv and envp arguments to be consted twice, once for
the pointer array and once for the strings the array points to. This is
because do_execve() passes a pointer to the filename (now const) to
copy_strings_kernel(). A simpler alternative would be to cast the filename
pointer in do_execve() when it's passed to copy_strings_kernel().
do_execve() may not change any of the strings it is passed as part of the argv
or envp lists as they are some of them in .rodata, so marking these strings as
const should be fine.
Further kernel_execve() and sys_execve() need to be changed to match.
This has been test built on x86_64, frv, arm and mips.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap reports:
ERROR: "svc_gss_principal" [fs/nfs/nfs.ko] undefined!
because in fs/nfs/Kconfig, NFS_V4 selects RPCSEC_GSS_KRB5
and/or in fs/nfsd/Kconfig, NFSD_V4 selects RPCSEC_GSS_KRB5.
RPCSEC_GSS_KRB5 does 5 selects, but none of these is enforced/followed
by the fs/nfs[d]/Kconfig configs:
select SUNRPC_GSS
select CRYPTO
select CRYPTO_MD5
select CRYPTO_DES
select CRYPTO_CBC
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
cifs has a lot of complicated functions that have to clean up things on
error, but some of them don't have all of the cleanup code
well-consolidated. Clean up and consolidate error handling in several
functions.
This is in preparation of later patches that will need to put references
to the tcon link container.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Get rid of some nesting and add a label we can goto on error.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
After applying commit b2ac86e1, the following message got appeared
after unclean shutdown:
> NILFS warning: broken superblock. using spare superblock.
This turns out to be a false message due to the change which updates
two super blocks alternately. The secondary super block now can be
selected if it's newer than the primary one.
This kills the false warning by suppressing it if another super block
is not actually broken.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This commit makes the stack guard page somewhat less visible to user
space. It does this by:
- not showing the guard page in /proc/<pid>/maps
It looks like lvm-tools will actually read /proc/self/maps to figure
out where all its mappings are, and effectively do a specialized
"mlockall()" in user space. By not showing the guard page as part of
the mapping (by just adding PAGE_SIZE to the start for grows-up
pages), lvm-tools ends up not being aware of it.
- by also teaching the _real_ mlock() functionality not to try to lock
the guard page.
That would just expand the mapping down to create a new guard page,
so there really is no point in trying to lock it in place.
It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.
Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.
Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix parameter name in kernel-doc notation (causes a warning).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark arguments to certain system calls as being const where they should be but
aren't. The list includes:
(*) The filename arguments of various stat syscalls, execve(), various utimes
syscalls and some mount syscalls.
(*) The filename arguments of some syscall helpers relating to the above.
(*) The buffer argument of various write syscalls.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last user is gone, so we can safely remove this
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
logfs does not need the BKL, so use ->unlocked_ioctl instead
of ->ioctl in file operations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Joern Engel <joern@logfs.org>
[ fixed trivial conflict ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
O2net: Disallow o2net accept connection request from itself.
ocfs2/dlm: remove potential deadlock -V3
ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
Fix the nested PR lock calling issue in ACL
ocfs2: Count more refcount records in file system fragmentation.
ocfs2 fix o2dlm dlm run purgelist (rev 3)
ocfs2/dlm: fix a dead lock
ocfs2: do not overwrite error codes in ocfs2_init_acl
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[NFS] Set CONFIG_KEYS when CONFIG_NFS_USE_KERNEL_DNS is set
AFS: Implement an autocell mount capability [ver #2]
DNS: If the DNS server returns an error, allow that to be cached [ver #2]
NFS: Use kernel DNS resolver [ver #2]
cifs: update README to include details about 'fsc' option
9c867fbe "partitions: fix sometimes unreadable partition strings" coverted
one line within the ibm partition code incorrectly. Fix this to get rid of
a build error.
fs/partitions/ibm.c: In function 'ibm_partition':
[...]
fs/partitions/ibm.c:185: error: too many arguments to function 'strlcat'
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This reverts commit 3bcf3860a4 (and the
accompanying commit c1e5c95402 "vfs/fsnotify: fsnotify_close can delay
the final work in fput" that was a horribly ugly hack to make it work at
all).
The 'struct file' approach not only causes that disgusting hack, it
somehow breaks pulseaudio, probably due to some other subtlety with
f_count handling.
Fix up various conflicts due to later fsnotify work.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previous patch relied on DNS_RESOLVER setting CONFIG_KEYS
but needs to be selected in NFS config when using the new
DNS resolver
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
CC: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'params' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (22 commits)
param: don't deref arg in __same_type() checks
param: update drivers/acpi/debug.c to new scheme
param: use module_param in drivers/message/fusion/mptbase.c
ide: use module_param_named rather than module_param_call
param: update drivers/char/ipmi/ipmi_watchdog.c to new scheme
param: lock if_sdio's lbs_helper_name and lbs_fw_name against sysfs changes.
param: lock myri10ge_fw_name against sysfs changes.
param: simple locking for sysfs-writable charp parameters
param: remove unnecessary writable charp
param: add kerneldoc to moduleparam.h
param: locking for kernel parameters
param: make param sections const.
param: use free hook for charp (fix leak of charp parameters)
param: add a free hook to kernel_param_ops.
param: silence .init.text references from param ops
Add param ops struct for hvc_iucv driver.
nfs: update for module_param_named API change
AppArmor: update for module_param_named API change
param: use ops in struct kernel_param, rather than get and set fns directly
param: move the EXPORT_SYMBOL to after the definitions.
...
Add a dummy printk function for the maintenance of unused printks through gcc
format checking, and also so that side-effect checking is maintained too.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 83ba7b071f ("writeback: simplify the write back thread queue")
broke writeback_in_progress() as in that commit we started to remove work
items from the list at the moment we start working on them and not at the
moment they are finished. Thus if the flusher thread was doing some work
but there was no other work queued, writeback_in_progress() returned
false. This could in particular cause unnecessary queueing of background
writeback from balance_dirty_pages() or writeout work from
writeback_sb_if_idle().
This patch fixes the problem by introducing a bit in the bdi state which
indicates that the flusher thread is processing some work and uses this
bit for writeback_in_progress() test.
NOTE: Both callsites of writeback_in_progress() (namely,
writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
need a different information than what writeback_in_progress() provides.
They would need to know whether *the kind of writeback they are going to
submit* is already queued. But this information isn't that simple to
provide so let's fix writeback_in_progress() for the time being.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unify the logic for kupdate and non-kupdate cases. There won't be
starvation because the inodes requeued into b_more_io will later be
spliced _after_ the remaining inodes in b_io, hence won't stand in the way
of other inodes in the next run.
It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or may exclude the inode from
the data integrity sync operation (which checks timestamp to avoid extra
work and livelock).
===
How the redirty_tail() comes about:
It was a long story.. This redirty_tail() was introduced with
wbc.more_io. The initial patch for more_io actually does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arised:
reiserfs:
http://lkml.org/lkml/2007/10/23/93
jfs:
commit 29a424f283
JFS: clear PAGECACHE_TAG_DIRTY for no-write pages
ext2:
http://www.spinics.net/linux/lists/linux-ext4/msg04762.html
They are all old bugs hidden in various filesystems that become "visible"
with the more_io patch. At the time, the ext2 bug is thought to be
"trivial", so not fixed. Instead the following updated more_io patch with
redirty_tail() is merged:
http://www.spinics.net/linux/lists/linux-ext4/msg04507.html
This will in general prevent 100% on ext2 and possibly other unknown FS bugs.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was not a bug, since b_io is empty for kupdate writeback. The next
patch will do requeue_io() for non-kupdate writeback, so let's fix it.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid delaying writeback for an expire inode with lots of dirty pages, but
no active dirtier at the moment. Previously we only do that for the
kupdate case.
Any filesystem that does delayed allocation or unwritten extent conversion
after IO completion will cause this - for example, XFS.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In CoW, when we meet with a readahead page, we know
it is time to move the readahead window. So carry
out a new readahead.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
struct file * has file_ra_state to store the readahead state
and data. So pass this to ocfs2_prepare_inode_for_write. so
that it can be used in ocfs2_refcount_cow.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
struct file * has file_ra_state to store the readahead state
and data. So pass this to ocfs2_write_begin_nolock so that
it can be used in ocfs2_refcount_cow.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Implement the ability for the root directory of a mounted AFS filesystem to
accept lookups of arbitrary directory names, to interpet the names as the names
of cells, to look the cell names up in the DNS for AFSDB records and to mount
the root.cell volume of the nominated cell on the pseudo-directory created by
lookup.
This facility is requested by passing:
-o autocell
to the mountpoint for which this is desired, usually the /afs mount.
To use this facility, a DNS upcall program is required for AFSDB records. This
can be obtained from:
http://people.redhat.com/~dhowells/afs/dns.afsdb.c
It should be compiled with -lresolv and -lkeyutils and installed as, say:
/usr/sbin/dns.afsdb
Then the following line needs to be added to /sbin/request-key.conf:
create dns_resolver afsdb:* * /usr/sbin/dns.afsdb %k
This can be tested by mounting AFS, say:
insmod dns_resolver.ko
insmod af-rxrpc.ko
insmod kafs.ko rootcell=grand.central.org
mount -t afs "#grand.central.org:root.cell." /afs -o autocell
and doing:
ls /afs/grand.central.org/
which should show:
archive/ cvs/ doc/ local/ project/ service/ software/ user/ www/
if it works.
Signed-off-by: Wang Lei <wang840925@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If the DNS server returns an error, allow that to be cached in the DNS resolver
key in lieu of a value. Userspace passes the desired error number as an option
in the payload:
"#dnserror=<number>"
Userspace must map h_errno from the name resolution routines to an appropriate
Linux error before passing it up. Something like the following mapping is
recommended:
[HOST_NOT_FOUND] = ENODATA,
[TRY_AGAIN] = EAGAIN,
[NO_RECOVERY] = ECONNREFUSED,
[NO_DATA] = ENODATA,
in lieu of Linux errors specifically for representing name service errors. The
filesystem must map these errors appropropriately before passing them to
userspace. AFS is made to map ENODATA and EAGAIN to EDESTADDRREQ for the
return to userspace; ECONNREFUSED is allowed to stand as is.
The error can be seen in /proc/keys as a negative number after the description
of the key. Compare, for example, the following key entries:
2f97238c I--Q-- 1 53s 3f010000 0 0 dns_resol afsdb:grand.centrall.org: -61
338bfbbe I--Q-- 1 59m 3f010000 0 0 dns_resol afsdb:grand.central.org: 37
If the error option is supplied in the payload, the main part of the payload is
discarded. The key should have an expiry time set by userspace.
Signed-off-by: Wang Lei <wang840925@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Use the kernel DNS resolver to translate hostnames to IP addresses. Create a
new config option to choose between the legacy DNS resolver and the new
resolver.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
By the commit af7fa16 2010-08-03 NFS: Fix up the fsync code
close(2) became returning the non-zero value even if it went well.
nfs_file_fsync() should return 0 when "status" is positive.
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
isofs: Fix lseek() to position beyond 4 GB
vfs: remove unused MNT_STRICTATIME
vfs: show unreachable paths in getcwd and proc
vfs: only add " (deleted)" where necessary
vfs: add prepend_path() helper
vfs: __d_path: dont prepend the name of the root dentry
ia64: perfmon: add d_dname method
vfs: add helpers to get root and pwd
cachefiles: use path_get instead of lone dget
fs/sysv/super.c: add support for non-PDP11 v7 filesystems
V7: Adjust sanity checks for some volumes
Add v7 alias
v9fs: fixup for inode_setattr being removed
Manual merge to take Al's version of the fs/sysv/super.c file: it merged
cleanly, but Al had removed an unnecessary header include, so his side
was better.
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: fix checkpatch.pl warnings
Squashfs: fix filename typo
Squashfs: update Kconfig and documentation for LZO
Squashfs: fix block size use in LZO decompressor
Squashfs: Add LZO compression support
squashfs: fix filename in header comment
Squashfs: Make XATTR config name consistent with other file systems
squashfs: fix compiler inline warning
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: Fix groups code when num_devices is not divisible by group_width
exofs: Remove useless optimization
exofs: exofs_file_fsync and exofs_file_flush correctness
exofs: Remove superfluous dependency on buffer_head and writeback
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
ceph: generalize mon requests, add pool op support
ceph: only queue async writeback on cap revocation if there is dirty data
ceph: do not ignore osd_idle_ttl mount option
ceph: constify dentry_operations
ceph: whitespace cleanup
ceph: add flock/fcntl lock support
ceph: define on-wire types, constants for file locking support
ceph: add CEPH_FEATURE_FLOCK to the supported feature bits
ceph: support v2 reconnect encoding
ceph: support v2 client_caps encoding
ceph: move AES iv definition to shared header
ceph: fix decoding of pool snap info
ceph: make ->sync_fs not wait if wait==0
ceph: warn on missing snap realm
ceph: print useful error message when crush rule not found
ceph: use %pU to print uuid (fsid)
ceph: sync header defs with server code
ceph: clean up header guards
ceph: strip misleading/obsolete version, feature info
ceph: specify supported features in super.h
...
This adds byte order autodetection (of PDP-11 and LE filesystems). No
attempt is made to detect big-endian filesystems -- were there any?
Tested with PDP-11 v7 filesystems and PC-IX maintenance floppy.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Newly mkfs-ed filesystems from Seventh Edition have last modification time
set to zero, but are otherwise perfectly valid.
Also, tighten up other sanity checks to filter out most filesystems with
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So that the module gets autoloaded when a v7 filesystem is mounted.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can clean up the work queue on this error path. This function is
called from afs_init().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the entire fs/proc directory is conditionally included based on
CONFIG_PROC_FS, it's redundant to check that same variable within that
directory.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If signalfd is used to consume a signal generated by a POSIX interval
timer or POSIX message queue, the ssi_int field does not reflect the data
(sigevent->sigev_value) supplied to timer_create(2) or mq_notify(3). (The
ssi_ptr field, however, is filled in.)
This behavior differs from signalfd's treatment of sigqueue-generated
signals -- see the default case in signalfd_copyinfo. It also gives
results that differ from the case when a signal is handled conventionally
via a sigaction-registered handler.
So, set signalfd_siginfo->ssi_int in the remaining cases (__SI_TIMER,
__SI_MESGQ) where ssi_ptr is set.
akpm: a non-back-compatible change. Merge into -stable to minimise the
number of kernels which are in the field and which miss this feature.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cgroup device whitelist code gets confused when trying to grant
permission to a disk partition that is not currently open. Part of
blkdev_open() includes __blkdev_get() on the whole disk.
Basically, the only ways to reliably allow a cgroup access to a partition
on a block device when using the whitelist are to 1) also give it access
to the whole block device or 2) make sure the partition is already open in
a different context.
The patch avoids the cgroup check for the whole disk case when opening a
partition.
Addresses https://bugzilla.redhat.com/show_bug.cgi?id=589662
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After 97e7449a7ad: "autofs4: fix indirect mount pending expire race" we no
longer assumed that "ino" can be null. The other null checks got removed
but this was one was missed.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kmalloc() to allocate fdmem if possible.
vmalloc() is used as a fallback solution for fdmem allocation. A new
helper function __free_fdtable() is introduced to reduce the lines of
code.
A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
[akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The O_* bit numbers are defined in 20+ arch/*, and can silently overlap.
Add a compile time check to ensure the uniqueness as suggested by David
Miller.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Improve the description of fget_light(), which is currently incorrect
about needing a prior refcnt (judging by the way it is actually used).
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is more kernel-ish, saves some space, and also allows us to
expand the ops without breaking all the callers who are happy for the
new members to be NULL.
The few places which defined their own param types are changed to the
new scheme (more which crept in recently fixed in following patches).
Since we're touching them anyway, we change get() and set() to take a
const struct kernel_param (which they really are). This causes some
harmless warnings until we fix them (in following patches).
To reduce churn, module_param_call creates the ops struct so the callers
don't have to change (and casts the functions to reduce warnings).
The modern version which takes an ops struct is called module_param_cb.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ville Syrjala <syrjala@sci.fi>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Alessandro Rubini <rubini@ipvvis.unipv.it>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-kernel@vger.kernel.org
Cc: linux-input@vger.kernel.org
Cc: linux-fbdev-devel@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
isofs supports files larger than 4 GB by using multi-extent files.
However an lseek() to a position beyond 4 GB in such a file will
fail with EINVAL, because s_maxbytes in the isofs superblock is
initialized to 2^32-1, and generic_file_llseek() checks against
that value.
I therefore suggest increasing the value of s_maxbytes to have
full support for large files in isofs. With multi-extent files, file
size is only limited by the maximum size of the file system (8 TB),
so this seems a reasonable value for s_maxbytes.
Signed-off-by: Jan Andres <jandres@gmx.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit d0adde574b added MNT_STRICTATIME
but it isn't actually used (MS_STRICTATIME clears MNT_RELATIME and
MNT_NOATIME rather than setting any mount flag).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Prepend "(unreachable)" to path strings if the path is not reachable
from the current root.
Two places updated are
- the return string from getcwd()
- and symlinks under /proc/$PID.
Other uses of d_path() are left unchanged (we know that some old
software crashes if /proc/mounts is changed).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__d_path() has 4 callers:
d_path()
sys_getcwd()
seq_path_root()
tomoyo_realpath_from_path2()
Of these the only one which needs the " (deleted)" ending is d_path().
sys_getcwd() checks for existence before calling __d_path().
seq_path_root() is used to show the mountpoint path in
/proc/PID/mountinfo, which is always a positive.
And tomoyo doesn't want the deleted ending.
Create a helper "path_with_deleted()" as subsequent patches will need
this in multiple places.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split off prepend_path() from __d_path(). This new helper takes an
end-of-buffer pointer and buffer-length pointer just like the other
prepend_* functions. Move the " (deleted)" postfix out to __d_path().
This patch doesn't change any functionality but paves the way for the
following patches.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the old times pseudo-filesystems set the name of theroot dentry to
some prefix like "pipe:" and the name of the child dentry to "[123]"
and relied on a hack in __d_path() to replace the preceding slash with
the root's name to get "pipe:[123]".
Then the d_dname() dentry operation was introduced which solved the
same problem without having to pre-fill the name in each dentry.
Currently the following pseudo filesystems exist in the kernel:
perfmon
mtd
anon_inode
bdev
pipe
socket
Of these only perfmon, anon_inode, pipe and socket create
sub-dentries, all of which have now been switched to using d_dname().
bdev and mtd only create inodes.
This means that now the hack to overwrite the slash can be removed, so
for unreachable paths (e.g. within a detached mount) the path string
won't be polluted with garbage. For these cases a subsequent patch
will add a prefix, indicating that the path is unreachable.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add three helpers that retrieve a refcounted copy of the root and cwd
from the supplied fs_struct.
get_fs_root()
get_fs_pwd()
get_fs_root_and_pwd()
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dentry references should not be acquired without a corresponding
vfsmount ref.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds byte order autodetection (of PDP-11 and LE filesystems). No
attempt is made to detect big-endian filesystems -- were there any?
Tested with PDP-11 v7 filesystems and PC-IX maintenance floppy.
[akpm@linux-foundation.org: coding-style fixes]
[AV: parser.h inclusion was a rudiment of discarded stuff]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Newly mkfs-ed filesystems from Seventh Edition have last modification
time set to zero, but are otherwise perfectly valid.
Also, tighten up other sanity checks to filter out most filesystems with
different bytesex than we're using.
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
So that the module gets autoloaded when a v7 filesystem is mounted.
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's currently possible to bypass xattr namespace access rules by
prefixing valid xattr names with "os2.", since the os2 namespace stores
extended attributes in a legacy format with no prefix.
This patch adds checking to deny access to any valid namespace prefix
following "os2.".
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (68 commits)
U6715 16550A serial driver support
Char: nozomi, set tty->driver_data appropriately
Char: nozomi, fix tty->count counting
serial: max3107: Fix gpiolib support
hsu: call PCI pm hooks in suspend/resume function
hsu: some code cleanup
hsu: add a periodic timer to check dma rx channel
hsu: driver for Medfield High Speed UART device
mxser: remove unnesesary NULL check
serial: add support for OX16PCI958 card
serial: 68328serial.c: remove dead (ALMA_ANS | DRAGONIXVZ | M68EZ328ADS)
timbuart: use __devinit and __devexit macros for probe and remove
serial: MMIO32 support for 8250_early.c
serial: mcf: don't take spinlocks in already protected functions
serial: general fixes in the serial_rs485 structure
serial: fix missing bit coverage of ASYNC_FLAGS
serial: "altera_uart: simplify altera_uart_console_putc()" checkpatch fixes
serial: crisv10: formatting of pointers in printk()
vt: Fix warning: statement with no effect due to vt_kern.h
tty_io: remove casts from void*
...
Generalize the current statfs synchronous requests, and support pool_ops.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Running "cat /proc/mounts" fails to display the "lookupcache" option.
This oversight cost me a bunch of wasted time recently.
The following simple patch fixes it.
CC: stable <stable@kernel.org>
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch is against the 2.6.34 source.
Paraphrased from the 1989 BSD patch by David Borman @ cray.com:
These are the changes needed for the kernel to support
LINEMODE in the server.
There is a new bit in the termios local flag word, EXTPROC.
When this bit is set, several aspects of the terminal driver
are disabled. Input line editing, character echo, and mapping
of signals are all disabled. This allows the telnetd to turn
off these functions when in linemode, but still keep track of
what state the user wants the terminal to be in.
New ioctl:
TIOCSIG Generate a signal to processes in the
current process group of the pty.
There is a new mode for packet driver, the TIOCPKT_IOCTL bit.
When packet mode is turned on in the pty, and the EXTPROC bit
is set, then whenever the state of the pty is changed, the
next read on the master side of the pty will have the TIOCPKT_IOCTL
bit set. This allows the process on the server side of the pty
to know when the state of the terminal has changed; it can then
issue the appropriate ioctl to retrieve the new state.
Since the original BSD patches accompanied the source code for telnet
I've left that reference here, but obviously the feature is useful for
any remote terminal protocol, including ssh.
The corresponding feature has existed in the BSD tty driver since 1989.
For historical reference, a good copy of the relevant files can be found
here:
http://anonsvn.mit.edu/viewvc/krb5/trunk/src/appl/telnet/?pathrev=17741
Signed-off-by: Howard Chu <hyc@symas.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
ecryptfs: dont call lookup_one_len to avoid NULL nameidata
fs/ecryptfs/file.c: introduce missing free
ecryptfs: release reference to lower mount if interpose fails
eCryptfs: Handle ioctl calls with unlocked and compat functions
ecryptfs: Fix warning in ecryptfs_process_response()
* git://git.infradead.org/mtd-2.6: (79 commits)
mtd: Remove obsolete <mtd/compatmac.h> include
mtd: Update copyright notices
jffs2: Update copyright notices
mtd-physmap: add support users can assign the probe type in board files
mtd: remove redwood map driver
mxc_nand: Add v3 (i.MX51) Support
mxc_nand: support 8bit ecc
mxc_nand: fix correct_data function
mxc_nand: add V1_V2 namespace to registers
mxc_nand: factor out a check_int function
mxc_nand: make some internally used functions overwriteable
mxc_nand: rework get_dev_status
mxc_nand: remove 0xe00 offset from registers
mtd: denali: Add multi connected NAND support
mtd: denali: Remove set_ecc_config function
mtd: denali: Remove unuseful code in get_xx_nand_para functions
mtd: denali: Remove device_info_tag structure
mtd: m25p80: add support for the Winbond W25Q32 SPI flash chip
mtd: m25p80: add support for the Intel/Numonyx {16,32,64}0S33B SPI flash chips
mtd: m25p80: add support for the EON EN25P{32, 64} SPI flash chips
...
Fix up trivial conflicts in drivers/mtd/maps/{Kconfig,redwood.c} due to
redwood driver removal.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bcopeland/omfs:
omfs: fix uninitialized variable warning
omfs: sanity check cluster size
omfs: refuse to mount if bitmap pointer is obviously wrong
omfs: check bounds on block numbers before passing to sb_bread
omfs: fix memory leak
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
fanotify: use both marks when possible
fsnotify: pass both the vfsmount mark and inode mark
fsnotify: walk the inode and vfsmount lists simultaneously
fsnotify: rework ignored mark flushing
fsnotify: remove global fsnotify groups lists
fsnotify: remove group->mask
fsnotify: remove the global masks
fsnotify: cleanup should_send_event
fanotify: use the mark in handler functions
audit: use the mark in handler functions
dnotify: use the mark in handler functions
inotify: use the mark in handler functions
fsnotify: send fsnotify_mark to groups in event handling functions
fsnotify: Exchange list heads instead of moving elements
fsnotify: srcu to protect read side of inode and vfsmount locks
fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
fsnotify: use _rcu functions for mark list traversal
fsnotify: place marks on object in order of group memory address
vfs/fsnotify: fsnotify_close can delay the final work in fput
fsnotify: store struct file not struct path
...
Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
To obey NFS cache semantics, the client must verify the cached
attributes when a file is opened. In most cases this is done by a call to
d_validate as one of the last steps in path_walk.
However for the root of a filesystem, d_validate is only ever called
on the mounted-on filesystem (except when the path ends '.' or '..').
So NFS has no chance to validate the attributes.
So, in nfs_opendir, we revalidate the attributes if the opened
directory is the mountpoint. This may cause double-validation for "."
and ".." lookups, but that is better than missing regular /path/name
lookups completely.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Using:
gcc (GCC) 4.5.0 20100610 (prerelease)
The following warnings appear:
fs/readdir.c: In function `filldir64':
fs/readdir.c:240:15: warning: `dirent' is used uninitialized in this function
fs/readdir.c: In function `filldir':
fs/readdir.c:155:15: warning: `dirent' is used uninitialized in this function
fs/compat.c: In function `compat_filldir64':
fs/compat.c:1071:11: warning: `dirent' is used uninitialized in this function
fs/compat.c: In function `compat_filldir':
fs/compat.c:984:15: warning: `dirent' is used uninitialized in this function
The warnings are related to the use of the NAME_OFFSET() macro. Luckily,
it appears as though the standard offsetof() macro is what is being
implemented by NAME_OFFSET(), thus we can fix the warning and use a more
standard code construct at the same time.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't
write out some huge inode for too long while starving writeout of other
inodes. To avoid livelocks, we record time we started writeback in
wbc->wb_start and do not write out inodes which were dirtied after this
time. But currently, writeback_inodes_wb() resets wb_start each time it
is called thus effectively invalidating this logic and making any
WB_SYNC_NONE writeback prone to livelocks.
This patch makes sure wb_start is set only once when we start writeback.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed. The target date for removal is August 2012.
A warning will be printed to the kernel log if a task attempts to use this
interface. Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.
Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead. This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits. This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.
The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory. "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit. The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.
The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.
Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.
Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability. Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered. The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.
/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a kernel thread is using use_mm(), badness() returns a positive value.
This is not a big issue because caller take care of it correctly. But
there is one exception, /proc/<pid>/oom_score calls badness() directly and
doesn't care that the task is a regular process.
Another example, /proc/1/oom_score return !0 value. But it's unkillable.
This incorrectness makes administration a little confusing.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>