cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection. Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends. Replace cgroup->name and all related code with kernfs
name/path constructs.
* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
of kernfs counterparts, which involves semantic changes.
pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
* cgroup->name handling dropped from cgroup_rename().
* All users of cgroup_name/path() updated to the new semantics. Users
which were formatting the string just to printk them are converted
to use pr_cont_cgroup_name/path() instead, which simplifies things
quite a bit. As cgroup_name() no longer requires RCU read lock
around it, RCU lockings which were protecting only cgroup_name() are
removed.
v2: Comment above oom_info_lock updated as suggested by Michal.
v3: dummy_top doesn't have a kn associated and
pr_cont_cgroup_name/path() ended up calling the matching kernfs
functions with NULL kn leading to oops. Test for NULL kn and
print "/" if so. This issue was reported by Fengguang Wu.
v4: Rebased on top of 0ab02ca8f8 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
cgroup_subsys is a bit messier than it needs to be.
* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.
* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.
* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.
This patch cleans up cgroup_subsys names and initialization by doing
the followings.
* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.
* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.
cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event
* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.
* Redundant cgroup_subsys declarations removed.
* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.
This patch doesn't introduce any behavior changes.
v2: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
classid handling into core").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
As sysfs was kernfs's only user, kernfs has been piggybacking on
CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
soon. Introduce a separate config option CONFIG_KERNFS which is to be
selected by kernfs users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_node->parent and ->name are currently marked as "published"
indicating that kernfs users may access them directly; however, those
fields may get updated by kernfs_rename[_ns]() and unrestricted access
may lead to erroneous values or oops.
Protect ->parent and ->name updates with a irq-safe spinlock
kernfs_rename_lock and implement the following accessors for these
fields.
* kernfs_name() - format the node's name into the specified buffer
* kernfs_path() - format the node's path into the specified buffer
* pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer)
* pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer)
* kernfs_get_parent() - pin and return a node's parent
All can be called under any context. The recursive sysfs_pathname()
in fs/sysfs/dir.c is replaced with kernfs_path() and
sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
dereferencing parent directly.
v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
static inline making it cause a lot of build warnings. Add it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Implement helpers to determine node from dentry and root from
super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL
namespace. These generally make sense and will be used by cgroup.
v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed.
Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A write to a kernfs_node is buffered through a kernel buffer. Writes
<= PAGE_SIZE are performed atomically, while larger ones are executed
in PAGE_SIZE chunks. While this is enough for sysfs, cgroup which is
scheduled to be converted to use kernfs needs a bit more control over
it.
This patch adds kernfs_ops->atomic_write_len. If not set (zero), the
behavior stays the same. If set, writes upto the size are executed
atomically and larger writes are rejected with -E2BIG.
A different implementation strategy would be allowing configuring
chunking size while making the original write size available to the
write method; however, such strategy, while being more complicated,
doesn't really buy anything. If the write implementation has to
handle chunking, the specific chunk size shouldn't matter all that
much.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, kernfs_nodes are made visible to userland on creation,
which makes it difficult for kernfs users to atomically succeed or
fail creation of multiple nodes. In addition, if something fails
after creating some nodes, the created nodes might already be in use
and their active refs need to be drained for removal, which has the
potential to introduce tricky reverse locking dependency on active_ref
depending on how the error path is synchronized.
This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
If set, all nodes under the root are created in the deactivated state
and stay invisible to userland until explicitly enabled by the new
kernfs_activate() API. Also, nodes which have never been activated
are guaranteed to bypass draining on removal thus allowing error paths
to not worry about lockding dependency on active_ref draining.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were
missing kernfs_active() tests before using the found kernfs_node. As
deactivated state is currently visible only while a node is being
removed, this doesn't pose an actual problem. e.g. lookup succeeding
on a deactivated node doesn't harm anything as the eventual file
operations are gonna fail and those failures are indistinguishible
from the cases in which the lookups had happened before the node was
deactivated.
However, we're gonna allow new nodes to be created deactivated and
then activated explicitly by the kernfs user when it sees fit. This
is to support atomically making multiple nodes visible to userland and
thus those nodes must not be visible to userland before activated.
Let's plug the lookup and readdir holes so that deactivated nodes are
invisible to userland.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add two super_block related syscall callbacks ->remount_fs() and
->show_options() to kernfs_syscall_ops. These simply forward the
matching super_operations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We're gonna need non-dir syscall callbacks, which will make dir_ops a
misnomer. Let's rename kernfs_dir_ops to kernfs_syscall_ops.
This is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_dir_ops are currently being invoked without any active
reference, which makes it tricky for the invoked operations to
determine whether the objects associated those nodes are safe to
access and will remain that way for the duration of such operations.
kernfs already has active_ref mechanism to deal with this which makes
the removal of a given node the synchronization point for gating the
file operations. There's no reason for dir_ops to be any different.
Update the dir_ops handling so that active_ref is held while the
dir_ops are executing. This guarantees that while a dir_ops is
executing the target nodes stay alive.
As kernfs_dir_ops doesn't have any in-kernel user at this point, this
doesn't affect anybody.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
All device_schedule_callback_owner() users are converted to use
device_remove_file_self(). Remove now unused
{sysfs|device}_schedule_callback_owner().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.
This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.
The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref the task is holding, removes the self
node, and restores active ref to the dead node so that the ref is
balanced afterwards. __kernfs_remove() is updated so that it takes an
early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.
This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.
This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.
Note that manipulation of active ref is implemented in separate public
functions - kernfs_[un]break_active_protection().
kernfs_remove_self() is the only user at the moment but this will be
used to cater to more complex cases.
v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.
v3: kernfs_[un]break_active_protection() separated out from
kernfs_remove_self() and exposed as public API.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps that of deactivation.
It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.
This patch removes KERNFS_REMOVED.
* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..
* A new helper kernfs_active() which tests whether kn->active >= 0 is
added for convenience and lockdep annotation. All KERNFS_REMOVED
tests are replaced with negated kernfs_active() tests.
* __kernfs_remove() is updated to deactivate, but not drain, all nodes
in the subtree instead of setting KERNFS_REMOVED. This removes
deactivation from kernfs_deactivate(), which is now renamed to
kernfs_drain().
* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.
* Some comment style updates in the affected area.
v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.
v3: Reverted most of v2 except for creating a new node with
KN_DEACTIVATED_BIAS.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().
While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF.
While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.
v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag"). As the earlier patch already added
KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
dropped from this patch and the existing ones are simply converted
to kernfs_lockdep().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.
This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().
* kernfs_add_one() is updated to grab and release kernfs_mutex itself.
sysfs_addrm_start/finish() invocations around it are removed from
all users.
* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.
v2: Rebased on top of "kernfs: associate a new kernfs_node with its
parent on creation" which dropped @parent from kernfs_add_one().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
the target file before kernfs_remove() finishes; however, it currently
is being called from kernfs_addrm_finish() and has the same race
problem as the original implementation of deactivation when there are
multiple removers - only the remover which snatches the node to its
addrm_cxt->removed list is guaranteed to wait for its completion
before returning.
It can be easily fixed by moving kernfs_unmap_bin_file() invocation
from kernfs_addrm_finish() to kernfs_deactivated(). The function may
be called multiple times but that shouldn't do any harm.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The recursive nature of kernfs_remove() means that, even if
kernfs_remove() is not allowed to be called multiple times on the same
node, there may be race conditions between removal of parent and its
descendants. While we can claim that kernfs_remove() shouldn't be
called on one of the descendants while the removal of an ancestor is
in progress, such rule is unnecessarily restrictive and very difficult
to enforce. It's better to simply allow invoking kernfs_remove() as
the caller sees fit as long as the caller ensures that the node is
accessible.
The current behavior in such situations is broken. Whoever enters
removal path first takes the node off the hierarchy and then
deactivates. Following removers either return as soon as it notices
that it's not the first one or can't even find the target node as it
has already been removed from the hierarchy. In both cases, the
following removers may finish prematurely while the nodes which should
be removed and drained are still being processed by the first one.
This patch restructures so that multiple removers, whether through
recursion or direction invocation, always follow the following rules.
* When there are multiple concurrent removers, only one puts the base
ref.
* Regardless of which one puts the base ref, all removers are blocked
until the target node is fully deactivated and removed.
To achieve the above, removal path now first marks all descendants
including self REMOVED and then deactivates and unlinks leftmost
descendant one-by-one. kernfs_deactivate() is called directly from
__kernfs_removal() and drops and regrabs kernfs_mutex for each
descendant to drain active refs. As this means that multiple removers
can enter kernfs_deactivate() for the same node, the function is
updated so that it can handle multiple deactivators of the same node -
only one actually deactivates but all wait till drain completion.
The restructured removal path guarantees that a removed node gets
unlinked only after the node is deactivated and drained. Combined
with proper multiple deactivator handling, this guarantees that any
invocation of kernfs_remove() returns only after the node itself and
all its descendants are deactivated, drained and removed.
v2: Draining separated into a separate loop (used to be in the same
loop as unlink) and done from __kernfs_deactivate(). This is to
allow exposing deactivation as a separate interface later.
Root node removal was broken in v1 patch. Fixed.
v3: Revert most of v2 except for root node removal fix and
simplification of KERNFS_REMOVED setting loop.
v4: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag").
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate(). We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal. The removal path will be restructured to address the issue.
To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq. This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.
v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
KERNFS_LOCKDEP flag").
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set
before performing lockdep annotations and ends up feeding
uninitialized lockdep_map to lockdep triggering warning like the
following on USB stick hotunplug.
usb 1-2: USB disconnect, device number 2
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002
ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002
ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710
Call Trace:
[<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a
[<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70
[<ffffffff810f931a>] lock_acquire+0x9a/0x1d0
[<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130
[<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60
[<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0
[<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80
[<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0
[<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50
[<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80
[<ffffffff8177e25e>] device_del+0x12e/0x1d0
[<ffffffff819722c2>] usb_disconnect+0x122/0x1a0
[<ffffffff819749b5>] hub_thread+0x3c5/0x1290
[<ffffffff810c6a6d>] kthread+0xed/0x110
[<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0
Fix it by making kernfs_deactivate() perform lockdep annotations only
if KERNFS_LOCKDEP is set.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fs/nfs/nfs3proc.c is making use of xattr but was getting linux/xattr.h
indirectly through linux/cgroup.h, which will soon drop the inclusion
of xattr.h. Explicitly include linux/xattr.h from nfs3proc.c so that
compilation doesn't fail when linux/cgroup.h drops linux/xattr.h.
As the following cgroup changes will depend on these changes, it
probably would be easier to route this through cgroup branch. Would
that be okay?
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
HPFS needs to load 4 consecutive 512-byte sectors when accessing the
directory nodes or bitmaps. We can't switch to 2048-byte block size
because files are allocated in the units of 512-byte sectors.
Previously, the driver would allocate a 2048-byte area using kmalloc,
copy the data from four buffers to this area and eventually copy them
back if they were modified.
In the current implementation of the buffer cache, buffers are allocated
in the pagecache. That means that 4 consecutive 512-byte buffers are
stored in consecutive areas in the kernel address space. So, we don't
need to allocate extra memory and copy the content of the buffers there.
This patch optimizes the code to avoid copying the buffers. It checks
if the four buffers are stored in contiguous memory - if they are not,
it falls back to allocating a 2048-byte area and copying data there.
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously, hpfs scanned all bitmaps each time the user asked for free
space using statfs. This patch changes it so that hpfs scans the
bitmaps only once, remembes the free space and on next invocation of
statfs it returns the value instantly.
New versions of wine are hammering on the statfs syscall very heavily,
making some games unplayable when they're stored on hpfs, with load
times in minutes.
This should be backported to the stable kernels because it fixes
user-visible problem (excessive level load times in wine).
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both proc files are writeable and used for configuring cells. But
there is missing correct mode flag for writeable files. Without
this patch both proc files are read only.
[ It turns out they aren't really read-only, since root can write to
them even if the write bit isn't set due to CAP_DAC_OVERRIDE ]
Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cifs fixes from Steve French:
"A set of cifs fixes (mostly for symlinks, and SMB2 xattrs) and
cleanups"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix check for regular file in couldbe_mf_symlink()
[CIFS] Fix SMB2 mounts so they don't try to set or get xattrs via cifs
CIFS: Cleanup cifs open codepath
CIFS: Remove extra indentation in cifs_sfu_type
CIFS: Cleanup cifs_mknod
CIFS: Cleanup CIFSSMBOpen
cifs: Add support for follow_link on dfs shares under posix extensions
cifs: move unix extension call to cifs_query_symlink()
cifs: Re-order M-F Symlink code
cifs: Add create MFSymlinks to protocol ops struct
cifs: use protocol specific call for query_mf_symlink()
cifs: Rename MF symlink function names
cifs: Rename and cleanup open_query_close_cifs_symlink()
cifs: Fix memory leak in cifs_hardlink()
Pull vfs fixes from Al Viro:
"Several obvious fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Fix mountpoint reference leakage in linkat
hfsplus: use xattr handlers for removexattr
Typo in compat_sys_lseek() declaration
fs/super.c: sync ro remount after blocking writers
vfs: unexport the getname() symbol
Highlights:
- Fix several races in nfs_revalidate_mapping
- NFSv4.1 slot leakage in the pNFS files driver
- Stable fix for a slot leak in nfs40_sequence_done
- Don't reject NFSv4 servers that support ACLs with only ALLOW aces
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJS7Bb+AAoJEGcL54qWCgDyDuQP/17nKR5e6MLhixcAbvlcH+pN
8CGolAM3HmRXDWUW/PkBH3UguG8Tzx1Ex26vIxipPeTSwZabf6194Twj6L97DEGZ
2SouD158BW1TkAbhEN/alKB/4ZCPos05iXjZkrL7MRff+8FD0UvWR2pBT1F2jQdY
ZftG76Q72qhZHfH07ZMxM/v4Oy2Ge98RDD35gfuuqMSjHpmN9tiB55PeheW33LVY
fu6I/JEwmlJpgy2qUcDv7v0V4mDpjC7XbcjjHpMHL8zp/C5Rx/rdgt9OQPlwmjdV
FD8MWNXLc5TWxIouLDFPVUv3WZPjyu449QHS9Wc95fSqsHcdl4j4SwLAoSvUIdHt
vDI5PtWhw3WAezbtiuCQnT0xcoIOn5bLjOVP13taDcV9vlZLcFlyOpZ5gVE4/Yju
zm4sCW2+imDc74ERGa4fukF6QhzzAVmD8RlCJwuNzwCfXiZ36+xSanMYiPoUiwLL
OVNgymrm0fe7GVFQKWN2D+Vr68OQEmARO+KfA3UzP5rQV+9CU8zSHjbcoRWZ59QG
VahOS5WDLQSrMp8W37yAHH9IiAWveAAKJJTHlOniRqH90QYPgyW18fTo7YcpW313
AQGFgr/1n4t27MWRLu5rdoN5v8+kwNi0UV6oboNIPoP1v15NkEMvc7HKFj5M883R
qEYfe5wqN/eRNj68NT/+
=B7f0
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights:
- Fix several races in nfs_revalidate_mapping
- NFSv4.1 slot leakage in the pNFS files driver
- Stable fix for a slot leak in nfs40_sequence_done
- Don't reject NFSv4 servers that support ACLs with only ALLOW aces"
* tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: initialize the ACL support bits to zero.
NFSv4.1: Cleanup
NFSv4.1: Clean up nfs41_sequence_done
NFSv4: Fix a slot leak in nfs40_sequence_done
NFSv4.1 free slot before resending I/O to MDS
nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING
NFS: Fix races in nfs_revalidate_mapping
sunrpc: turn warn_gssd() log message into a dprintk()
NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
nfs: handle servers that support only ALLOW ACE type.
Recent changes to retry on ESTALE in linkat
(commit 442e31ca5a)
introduced a mountpoint reference leak and a small memory
leak in case a filesystem link operation returns ESTALE
which is pretty normal for distributed filesystems like
lustre, nfs and so on.
Free old_path in such a case.
[AV: there was another missing path_put() nearby - on the previous
goto retry]
Signed-off-by: Oleg Drokin: <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hfsplus was already using the handlers for get and set operations,
and with the removal of can_set_xattr we've now allow operations that
wouldn't otherwise be allowed.
With this we can also centralize the special-casing of the osx.
attrs that don't have prefixes on disk in the osx xattr handlers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move sync_filesystem() after sb_prepare_remount_readonly(). If writers
sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
it can cause inodes to be dirtied and writeback to occur well after
sys_mount() has completely successfully.
This was spotted by corrupted ubifs filesystems on reboot, but appears
that it can cause issues with any filesystem using writeback.
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Richard Weinberger <richard@nod.at>
Co-authored-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Ruder <andrew.ruder@elecsyscorp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Leaving getname() exported when putname() isn't is a bad idea.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
MF Symlinks are regular files containing content in a specified format.
The function couldbe_mf_symlink() checks the mode for a set S_IFREG bit
as a test to confirm that it is a regular file. This bit is also set for
other filetypes and simply checking for this bit being set may return
false positives.
We ensure that we are actually checking for a regular file by using the
S_ISREG macro to test instead.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Neil Brown <neilb@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Avoid returning incorrect acl mask attributes when the server doesn't
support ACLs.
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.
Jeff Mahoney kicked in code to export filesystem info into sysfs.
Otherwise, lots of performance improvements, cleanups and bug fixes.
Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...
Pull some further ceph acl cleanups from Sage Weil:
"I do have a couple patches on top of what's in your tree, though, that
clean up a couple duplicated lines in your fix and apply Christoph's
cleanup"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: simplify ceph_{get,init}_acl
ceph: remove duplicate declaration of ceph_setattr
- ->get_acl only gets called after we checked for a cached ACL, so no
need to call get_cached_acl again.
- no need to check IS_POSIXACL in ->get_acl, without that it should
never get set as all the callers that set it already have the check.
- you should be able to use the full posix_acl_create in CEPH
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Sage Weil <sage@inktank.com>
Pull block IO driver changes from Jens Axboe:
- bcache update from Kent Overstreet.
- two bcache fixes from Nicholas Swenson.
- cciss pci init error fix from Andrew.
- underflow fix in the parallel IDE pg_write code from Dan Carpenter.
I'm sure the 1 (or 0) users of that are now happy.
- two PCI related fixes for sx8 from Jingoo Han.
- floppy init fix for first block read from Jiri Kosina.
- pktcdvd error return miss fix from Julia Lawall.
- removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
Michael Opdenacker.
- comment typo fix for the loop driver from Olaf Hering.
- potential oops fix for null_blk from Raghavendra K T.
- two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
an OOM problem and a problem with handling security locked conditions
* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
mg_disk: Spelling s/finised/finished/
null_blk: Null pointer deference problem in alloc_page_buffers
mtip32xx: Correctly handle security locked condition
mtip32xx: Make SGL container per-command to eliminate high order dma allocation
drivers/block/loop.c: fix comment typo in loop_config_discard
drivers/block/cciss.c:cciss_init_one(): use proper errnos
drivers/block/paride/pg.c: underflow bug in pg_write()
drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
drivers/block/sx8.c: use module_pci_driver()
floppy: bail out in open() if drive is not responding to block0 read
bcache: Fix auxiliary search trees for key size > cacheline size
bcache: Don't return -EINTR when insert finished
bcache: Improve bucket_prio() calculation
bcache: Add bch_bkey_equal_header()
bcache: update bch_bkey_try_merge
bcache: Move insert_fixup() to btree_keys_ops
bcache: Convert sorting to btree_keys
bcache: Convert debug code to btree_keys
bcache: Convert btree_iter to struct btree_keys
bcache: Refactor bset_tree sysfs stats
...
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
Pull nfsd updates from Bruce Fields:
- Handle some loose ends from the vfs read delegation support.
(For example nfsd can stop breaking leases on its own in a
fewer places where it can now depend on the vfs to.)
- Make life a little easier for NFSv4-only configurations
(thanks to Kinglong Mee).
- Fix some gss-proxy problems (thanks Jeff Layton).
- miscellaneous bug fixes and cleanup
* 'for-3.14' of git://linux-nfs.org/~bfields/linux: (38 commits)
nfsd: consider CLAIM_FH when handing out delegation
nfsd4: fix delegation-unlink/rename race
nfsd4: delay setting current_fh in open
nfsd4: minor nfs4_setlease cleanup
gss_krb5: use lcm from kernel lib
nfsd4: decrease nfsd4_encode_fattr stack usage
nfsd: fix encode_entryplus_baggage stack usage
nfsd4: simplify xdr encoding of nfsv4 names
nfsd4: encode_rdattr_error cleanup
nfsd4: nfsd4_encode_fattr cleanup
minor svcauth_gss.c cleanup
nfsd4: better VERIFY comment
nfsd4: break only delegations when appropriate
NFSD: Fix a memory leak in nfsd4_create_session
sunrpc: get rid of use_gssp_lock
sunrpc: fix potential race between setting use_gss_proxy and the upcall rpc_clnt
sunrpc: don't wait for write before allowing reads from use-gss-proxy file
nfsd: get rid of unused function definition
Define op_iattr for nfsd4_open instead using macro
NFSD: fix compile warning without CONFIG_NFSD_V3
...
Chris Mason reported a NULL pointer derefernence in generic_getxattr()
that was due to sb->s_xattr being NULL.
The reason is that the nfs #ifdef's for ACL support were misplaced, and
the nfs3 inode operations had the xattr operation pointers set up, even
though xattrs were not actually supported. As a result, the xattr code
was being called without the infrastructure having been set up.
Move the #ifdef's appropriately.
Reported-and-tested-by: Chris Mason <clm@fb.com>
Acked-by: Al Viro viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge random fixes from Andrew Morton:
"Random fixes.
I have one batch remaining for -rc1, mainly zram changes which await a
merge of Jens's trees"
* emailed patches fron Andrew Morton akpm@linux-foundation.org>:
MAINTAINERS: ADI Linux development mailing lists: change to the new server
Documentation: fix multiple typo occurences s/KenelVersion/KernelVersion/
dma-debug: fix overlap detection
memblock: add limit checking to memblock_virt_alloc
mm/readahead.c: fix do_readahead() for no readpage(s)
mm/slub.c: do not VM_BUG_ON_PAGE() for temporary on-stack pages
slab: fix wrong retval on kmem_cache_create_memcg error path
s390/compat: change parameter types from unsigned long to compat_ulong_t
fs/compat: fix lookup_dcookie() parameter handling
fs/compat: fix parameter handling for compat readv/writev syscalls
mm/mempolicy.c: convert to pr_foo()
mm: numa: initialise numa balancing after jump label initialisation
mm/page-writeback.c: do not count anon pages as dirtyable memory
mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory
mm: document improved handling of swappiness==0
lib/genalloc.c: add check gen_pool_dma_alloc() if dma pointer is not NULL
Commit d5dc77bfee ("consolidate compat lookup_dcookie()") coverted all
architectures to the new compat_sys_lookup_dcookie() syscall.
The "len" paramater of the new compat syscall must have the type
compat_size_t in order to enforce zero extension for architectures where
the ABI requires that the caller of a function performed zero and/or
sign extension to 64 bit of all parameters.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org> [v3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We got a report that the pwritev syscall does not work correctly in
compat mode on s390.
It turned out that with commit 72ec35163f ("switch compat readv/writev
variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a
couple of syscall parameters because the some parameter types haven't
been converted from unsigned long to compat_ulong_t.
This is needed for architectures where the ABI requires that the caller
of a function performed zero and/or sign extension to 64 bit of all
parameters.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org> [v3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull fanotify use-after-free fixes from Jan Kara:
"Three fixes for the fanotify use after free problems guys were
reporting.
I have ended up with different lifetime rules for struct
fanotify_event_info depending on whether it is for permission event or
normal event which isn't ideal. My plan is to split these into two
different structures (as permission events need larger struct anyway)
which will make the rules trivial again. But that can wait for later
I guess (but I can add the patch to the pile if you want), now I
wanted to make -rc1 boot for these guys"
[ "These guys" being Jiri Kosina and Dave Jones that reported the slab
corruption issues due to incorrect object lifetimes ]
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: Fix use after free for permission events
fsnotify: Do not return merged event from fsnotify_add_notify_event()
fanotify: Fix use after free in mask checking
The merge of commit 7221fe4c2e ("ceph: add acl for cephfs") raced with
upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957
"fs: add generic xattr_acl handlers" and others).
Some of the fallout was fixed in commit 4db658ea0c ("ceph: Fix up after
semantic merge conflict"), but it was incomplete: the set_acl
inode_operation wasn't getting set, and the prototype needed to be
adjusted a bit (it doesn't take a dentry anymore).
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>