The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.
What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.
I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.
For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.
To achieve a reace free design all tagged directories are created
and managed by sysfs itself.
Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid
- Implement mount_ns() which returns the ns of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.
Everything else is left up to sysfs and the driver layer.
For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.
Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.
The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.
Currently only directories which hold kobjects, and
symlinks are supported. There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Move complete knowledge of namespaces into the kobject layer
so we can use that information when reporting kobjects to
userspace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Of the three uses of kref_set in the kernel:
One really should be kref_put as the code is letting go of a
reference,
Two really should be kref_init because the kref is being
initialised.
This suggests that making kref_set available encourages bad code.
So fix the three uses and remove kref_set completely.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.infradead.org/mtd-2.6: (154 commits)
mtd: cfi_cmdset_0002: use AMD standard command-set with Winbond flash chips
mtd: cfi_cmdset_0002: Fix MODULE_ALIAS and linkage for new 0701 commandset ID
mtd: mxc_nand: Remove duplicate NAND_CMD_RESET case value
mtd: update gfp/slab.h includes
jffs2: Stop triggering block erases from jffs2_write_super()
jffs2: Rename jffs2_erase_pending_trigger() to jffs2_dirty_trigger()
jffs2: Use jffs2_garbage_collect_trigger() to trigger pending erases
jffs2: Require jffs2_garbage_collect_trigger() to be called with lock held
jffs2: Wake GC thread when there are blocks to be erased
jffs2: Erase pending blocks in GC pass, avoid invalid -EIO return
jffs2: Add 'work_done' return value from jffs2_erase_pending_blocks()
mtd: mtdchar: Do not corrupt backing device of device node inode
mtd/maps/pcmciamtd: Fix printk format for ssize_t in debug messages
drivers/mtd: Use kmemdup
mtd: cfi_cmdset_0002: Fix argument order in bootloc warning
mtd: nand: add Toshiba TC58NVG0 device ID
pcmciamtd: add another ID
pcmciamtd: coding style cleanups
pcmciamtd: fixing obvious errors
mtd: chips: add SST39WF160x NOR-flashes
...
Trivial conflicts due to dev_node removal in drivers/mtd/maps/pcmciamtd.c
The only way the debugger can handle a trap in inside rcu_lock,
notify_die, or atomic_notifier_call_chain without a recursive fault is
to have a low level "first opportunity handler" do_trap_or_bp() handler.
Generally this will be something the vast majority of folks will not
need, but for those who need it, it is added as a kernel .config
option called KGDB_LOW_LEVEL_TRAP.
Also added was a die notification for oops such that kdb can catch an
oops for analysis.
There appeared to be no obvious way to pass the struct pt_regs from
the original exception back to the stack back tracer, so a special
case was added to show_stack() for when kdb is active because you
generally desire to generally look at the back trace of the original
exception.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
The only way the debugger can handle a trap in inside rcu_lock,
notify_die, or atomic_notifier_call_chain without a triple fault is
to have a low level "first opportunity handler" in the int3 exception
handler.
Generally this will be something the vast majority of folks will not
need, but for those who need it, it is added as a kernel .config
option called KGDB_LOW_LEVEL_TRAP.
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: H. Peter Anvin <hpa@zytor.com>
CC: x86@kernel.org
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
This patch adds in the kdb PS/2 keyboard driver. This was mostly a
direct port from the original kdb where I cleaned up the code against
checkpatch.pl and added the glue to stitch it into kgdb.
This patch also enables early kdb debug via kgdbwait and the keyboard.
All the access to configure kdb using either a serial console or the
keyboard is done via kgdboc.
If you want to use only the keyboard and want to break in early you
would add to your kernel command arguments:
kgdboc=kbd kgdbwait
If you wanted serial and or the keyboard access you could use:
kgdboc=kbd,ttyS0
You can also configure kgdboc as a kernel module or at run time with
the sysfs where you can activate and deactivate kgdb.
Turn it on:
echo kbd,ttyS0 > /sys/module/kgdboc/parameters/kgdboc
Turn it off:
echo "" > /sys/module/kgdboc/parameters/kgdboc
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
These are the minimum changes to the kgdb core in order to enable an
API to connect a new front end (kdb) to the debug core.
This patch introduces the dbg_kdb_mode variable controls where the
user level I/O is routed. It will be routed to the gdbstub (kgdb) or
to the kdb front end which is a simple shell available over the kgdboc
connection.
You can switch back and forth between kdb or the gdb stub mode of
operation dynamically. From gdb stub mode you can blindly type
"$3#33", or from the kdb mode you can enter "kgdb" to switch to the
gdb stub.
The logic in the debug core depends on kdb to look for the typical gdb
connection sequences and return immediately with KGDB_PASS_EVENT if a
gdb serial command sequence is detected. That should allow a
reasonably seamless transition between kdb -> gdb without leaving the
kernel exception state. The two gdb serial queries that kdb is
responsible for detecting are the "?" and "qSupported" packets.
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Martin Hicks <mort@sgi.com>
There are many different UUID/GUID definitions in kernel, such as that
in EFI, many file systems, some drivers, etc. Every kernel components
need UUID/GUID has its own definition. This patch provides a unified
definition for UUID/GUID.
UUID is defined via typedef. This makes that UUID appears more like a
preliminary type, and makes the data type explicit (comparing with
implicit "u8 uuid[16]").
The binary representation of UUID/GUID can be little-endian (used by
EFI, etc) or big-endian (defined by RFC4122), so both is defined.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
WARN() is used in some places to report firmware or hardware bugs that
are then worked-around. These bugs do not affect the stability of the
kernel and should not set the flag for TAINT_WARN. To allow for this,
add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint number
as argument.
Architectures that implement warnings using trap instructions instead
of calls to warn_slowpath_*() now implement __WARN_TAINT(taint)
instead of __WARN().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
* 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, pat: Update the page flags for memtype atomically instead of using memtype_lock
x86, pat: In rbt_memtype_check_insert(), update new->type only if valid
x86, pat: Migrate to rbtree only backend for pat memtype management
x86, pat: Preparatory changes in pat.c for bigger rbtree change
rbtree: Add support for augmented rbtrees
* 'core-hweight-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hweight: Use a 32-bit popcnt for __arch_hweight32()
arch, hweight: Fix compilation errors
x86: Add optimized popcnt variants
bitops: Optimize hweight() by making use of compile-time evaluation
* 'x86-atomic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix LOCK_PREFIX_HERE for uniprocessor build
x86, atomic64: In selftest, distinguish x86-64 from 586+
x86-32: Fix atomic64_inc_not_zero return value convention
lib: Fix atomic64_inc_not_zero test
lib: Fix atomic64_add_unless return value convention
x86-32: Fix atomic64_add_unless return value convention
lib: Fix atomic64_add_unless test
x86: Implement atomic[64]_dec_if_positive()
lib: Only test atomic64_dec_if_positive on archs having it
x86-32: Rewrite 32-bit atomic64 functions in assembly
lib: Add self-test for atomic64_t
x86-32: Allow UP/SMP lock replacement in cmpxchg64
x86: Add support for lock prefix in alternatives
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
rcu: remove all rcu head initializations, except on_stack initializations
rcu head introduce rcu head init on stack
Debugobjects transition check
rcu: fix build bug in RCU_FAST_NO_HZ builds
rcu: RCU_FAST_NO_HZ must check RCU dyntick state
rcu: make SRCU usable in modules
rcu: improve the RCU CPU-stall warning documentation
rcu: reduce the number of spurious RCU_SOFTIRQ invocations
rcu: permit discontiguous cpu_possible_mask CPU numbering
rcu: improve RCU CPU stall-warning messages
rcu: print boot-time console messages if RCU configs out of ordinary
rcu: disable CPU stall warnings upon panic
rcu: enable CPU_STALL_VERBOSE by default
rcu: slim down rcutiny by removing rcu_scheduler_active and friends
rcu: refactor RCU's context-switch handling
rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk
rcu: shrink rcutiny by making synchronize_rcu_bh() be inline
rcu: fix now-bogus rcu_scheduler_active comments.
rcu: Fix bogus CONFIG_PROVE_LOCKING in comments to reflect reality.
rcu: ignore offline CPUs in last non-dyntick-idle CPU check
...
If there are no active threasd using a semaphore, it is always correct
to unqueue blocked threads. This seems to be what was intended in the
undo code.
What was done instead, was to look for a sem count of zero - this is an
impossible situation, given that at least one thread is known to be
queued on the semaphore. The code might be correct as written, but it's
hard to reason about and it's not what was intended (otherwise the goto
out would have been unconditional).
Go for checking the active count - the alternative is not worth the
headache.
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement a basic state machine checker in the debugobjects.
This state machine checker detects races and inconsistencies within the "active"
life of a debugobject. The checker only keeps track of the current state; all
the state machine logic is kept at the object instance level.
The checker works by adding a supplementary "unsigned int astate" field to the
debug_obj structure. It keeps track of the current "active state" of the object.
The only constraints that are imposed on the states by the debugobjects system
is that:
- activation of an object sets the current active state to 0,
- deactivation of an object expects the current active state to be 0.
For the rest of the states, the state mapping is determined by the specific
object instance. Therefore, the logic keeping track of the state machine is
within the specialized instance, without any need to know about it at the
debugobject level.
The current object active state is changed by calling:
debug_object_active_state(addr, descr, expect, next)
where "expect" is the expected state and "next" is the next state to move to if
the expected state is found. A warning is generated if the expected is not
found.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: akpm@linux-foundation.org
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: peterz@infradead.org
CC: rostedt@goodmis.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
CC: eric.dumazet@gmail.com
CC: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CPU_STALL_VERBOSE kernel configuration parameter was added to
2.6.34 to identify any preempted/blocked tasks that were preventing
the current grace period from completing when running preemptible
RCU. As is conventional for new configurations parameters, this
defaulted disabled. It is now time to enable it by default.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is no need to disable lockdep after an RCU lockdep splat,
so remove the debug_lockdeps_off() from lockdep_rcu_dereference().
To avoid repeated lockdep splats, use a static variable in the inlined
rcu_dereference_check() and rcu_dereference_protected() macros so that
a given instance splats only once, but so that multiple instances can
be detected per boot.
This is controlled by a new config variable CONFIG_PROVE_RCU_REPEATEDLY,
which is disabled by default. This provides the normal lockdep behavior
by default, but permits people who want to find multiple RCU-lockdep
splats per boot to easily do so.
Requested-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Merge reason:
Conflict between LOCK_PREFIX_HERE and relative alternatives
pointers
Resolved Conflicts:
arch/x86/include/asm/alternative.h
arch/x86/kernel/alternative.c
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add a missing EXPORT_SYMBOL.
I must be the first person that wants to use this function :-)
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes 2 issues with the LZO decompressor:
- It doesn't handle the case where a block isn't compressed at all. In
this case, calling lzo1x_decompress_safe will fail, so we need to just
use memcpy() instead (the upstream LZO code does something similar)
- Since commit 54291362d2 ("initramfs: add
missing decompressor error check") , the decompressor return code is
checked in the init/initramfs.c The LZO decompressor didn't return the
expected value, causing the initramfs code to falsely believe a
decompression error occured
Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Tested-by: bert schulze <spambemyguest@googlemail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memset() is called with the wrong address and the kernel panics.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ef0658f3de changed precision
from int to s8.
There is existing kernel code that uses a larger precision.
An example from the audit code:
vsnprintf(...,..., " msg='%.1024s'", (char *)data);
which overflows precision and truncates to nothing.
Extending precision size fixes the audit system issue.
Other changes:
Change the size of the struct printf_spec.type from u16 to u8 so
sizeof(struct printf_spec) stays as small as possible.
Reorder the struct members so sizeof(struct printf_spec) remains 64 bits
without alignment holes.
Document the struct members a bit more.
Original-patch-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits)
cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch
loop: Update mtime when writing using aops
block: expose the statistics in blkio.time and blkio.sectors for the root cgroup
backing-dev: Handle class_create() failure
Block: Fix block/elevator.c elevator_get() off-by-one error
drbd: lc_element_by_index() never returns NULL
cciss: unlock on error path
cfq-iosched: Do not merge queues of BE and IDLE classes
cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging
i2o: Remove the dangerous kobj_to_i2o_device macro
block: remove 16 bytes of padding from struct request on 64bits
cfq-iosched: fix a kbuild regression
block: make CONFIG_BLK_CGROUP visible
Remove GENHD_FL_DRIVERFS
block: Export max number of segments and max segment size in sysfs
block: Finalize conversion of block limits functions
block: Fix overrun in lcm() and move it to lib
vfs: improve writeback_inodes_wb()
paride: fix off-by-one test
drbd: fix al-to-on-disk-bitmap for 4k logical_block_size
...
radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set()
or radix_tree_tag_clear(). The problem is that the double tag_get() in
radix_tree_tag_get():
if (!tag_get(node, tag, offset))
saw_unset_tag = 1;
if (height == 1) {
int ret = tag_get(node, tag, offset);
may see the value change due to the action of set/clear. RCU is no protection
against this as no pointers are being changed, no nodes are being replaced
according to a COW protocol - set/clear alter the node directly.
The documentation in linux/radix-tree.h, however, says that
radix_tree_tag_get() is an exception to the rule that "any function modifying
the tree or tags (...) must exclude other modifications, and exclude any
functions reading the tree".
The problem is that the next statement in radix_tree_tag_get() checks that the
tag doesn't vary over time:
BUG_ON(ret && saw_unset_tag);
This has been seen happening in FS-Cache:
https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html
To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various
comments that the value of the tag may change whilst the RCU read lock is held,
and thus that the return value of radix_tree_tag_get() may not be relied upon
unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from
running concurrently with it.
Reported-by: Romain DEGEZ <romain.degez@smartjog.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rwsems can be used with IRQs disabled, particularily in early boot
before IRQs are enabled. Currently the spin_unlock_irq() usage in the
slow-patch will unconditionally enable interrupts and cause problems
since interrupts are not yet initialized or enabled.
This patch uses save/restore versions of IRQ spinlocks in the slowpath
to ensure interrupts are not unintentionally disabled.
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The log of commit edaac8e316 ("ratelimit:
Fix/allow use in atomic contexts"), indicates that we want to suppress the
callback when the trylock fails.
Signed-off-by: Yong Zhang <yong.zhang@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prevent from wrongly using the return value.
[akpm@linux-foundation.org: fix spello]
Signed-off-by: Yong Zhang <yong.zhang@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Earlier in this function we set the last byte of "buf" to NULL so we
always hit the break statement and "i" is never equal to NAME_MAX_LEN.
This patch doesn't change how the driver works but it silences a Smatch
warning and it makes it clearer that we don't write past the end of the
array.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.
A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100318112015.GC11152@aftab>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Rename the extisting runtime hweight() implementations to
__arch_hweight(), rename the compile-time versions to __const_hweight()
and then have hweight() pick between them.
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100318111929.GB11152@aftab>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1265028224.24455.154.camel@laptop>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
This patch marks two functions, which only get called at
initialization, as __init.
Here is also interesting, that modpost doesn't catch here the right
function name.
WARNING: lib/built-in.o(.text+0x585f): Section mismatch in reference
from the function T.506() to the variable .init.data:obj
The function T.506() references the variable __initdata obj.
This is often because T.506 lacks a __initdata annotation or the
annotation of obj is wrong.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1269632315-19403-1-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We see only one section mismatch now after thousands of randconfigs, and a
bug has been filed about that one.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lcm() was defined to take integer-sized arguments. The supplied
arguments are multiplied, however, causing us to overflow given
sufficiently large input. That in turn led to incorrect optimal I/O
size reporting in some cases (RAID over RAID).
Switch lcm() over to unsigned long similar to gcd() and move the
function from blk-settings.c to lib.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add support for resource windows. This is for bridge resources, i.e.,
regions where a bridge forwards transactions from the primary to the
secondary side.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add support for bus number resources. This is for bridges with a range of
bus numbers behind them.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Provide generic perf_sample_data initialization
MAINTAINERS: Add Arnaldo as tools/perf/ co-maintainer
perf trace: Don't use pager if scripting
perf trace/scripting: Remove extraneous header read
perf, ARM: Modify kuser rmb() call to compile for Thumb-2
x86/stacktrace: Don't dereference bad frame pointers
perf archive: Don't try to collect files without a build-id
perf_events, x86: Fixup fixed counter constraints
perf, x86: Restrict the ANY flag
perf, x86: rename macro in ARCH_PERFMON_EVENTSEL_ENABLE
perf, x86: add some IBS macros to perf_event.h
perf, x86: make IBS macros available in perf_event.h
hw-breakpoints: Remove stub unthrottle callback
x86/hw-breakpoints: Remove the name field
perf: Remove pointless breakpoint union
perf lock: Drop the buffers multiplexing dependency
perf lock: Fix and add misc documentally things
percpu: Add __percpu sparse annotations to hw_breakpoint
inflate_fast() can do either POST INC or PRE INC on its pointers walking
the memory to decompress. Default is PRE INC.
The sout pointer offset was miscalculated in one case as the calculation
assumed sout was a char * This breaks inflate_fast() iff configured to do
POST INC.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6846ee5ca6 ("zlib: Fix build of
powerpc boot wrapper") made the new optimized inflate only available on
arch's that define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
This patch will again enable the optimization for all arch's by defining
our own endian independent version of unaligned access. As an added
bonus, arch's that define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS do a
plain load instead.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Constify struct sysfs_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Constify struct kset_uevent_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This reverts commit a069c266ae.
It turns ou that not only was it missing a case (XFS) that needed it,
but perhaps more importantly, people sometimes want to enable new
modules that they hadn't had enabled before, and if such a module uses
list_sort(), it can't easily be inserted any more.
So rather than add a "select LIST_SORT" to the XFS case, just leave it
compiled in. It's not all _that_ big, after all, and the inconvenience
isn't worth it.
Requested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds separate I/O and memory specs, so we don't have to change the
field width in a shared spec, which then lets us make all the specs const
and static, since they never change.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add clues about what the SMALL and SPECIAL flags do.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reducing the size of struct printf_spec is a good thing because multiple
instances are commonly passed on stack.
It's possible for type to be u8 and field_width to be s8, but this is
likely small enough for now.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs:
[LogFS] Change magic number
[LogFS] Remove h_version field
[LogFS] Check feature flags
[LogFS] Only write journal if dirty
[LogFS] Fix bdev erases
[LogFS] Silence gcc
[LogFS] Prevent 64bit divisions in hash_index
[LogFS] Plug memory leak on error paths
[LogFS] Add MAINTAINERS entry
[LogFS] add new flash file system
Fixed up trivial conflict in lib/Kconfig, and a semantic conflict in
fs/logfs/inode.c introduced by write_inode() being changed to use
writeback_control' by commit a9185b41a4
("pass writeback_control to ->write_inode")
The function name must be followed by a space, hypen, space, and a short
description.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Build list_sort() only for configs that need it -- those that don't save
~581 bytes (i386).
Signed-off-by: Don Mullis <don.mullis@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XFS and UBIFS can pass long lists to list_sort(); this alternative
implementation scales better, reaching ~3x performance gain when list
length exceeds the L2 cache size.
Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
gcc-4.4, with flags extracted from an Ubuntu kernel build. Object size is
581 bytes compared to 455 for Mark J. Roberts' code.
Worst case for either implementation is a list length just over a power of
two, and to roughly the same degree, so here are timing results for a
range of 2^N+1 lengths. List elements were 16 bytes each including malloc
overhead; initial order was random.
time (msec)
Tatham-Roberts
| generic-Mullis-v2
loop_count length | | ratio
4000000 2 206 294 1.427
2000000 3 176 227 1.289
1000000 5 199 172 0.864
500000 9 235 178 0.757
250000 17 243 182 0.748
125000 33 261 196 0.750
62500 65 277 209 0.754
31250 129 292 219 0.75
15625 257 317 235 0.741
7812 513 340 252 0.741
3906 1025 362 267 0.737
1953 2049 388 283 0.729 ~ L1 size
976 4097 556 323 0.580
488 8193 678 361 0.532
244 16385 773 395 0.510
122 32769 844 418 0.495
61 65537 917 454 0.495
30 131073 1128 543 0.481
15 262145 2355 869 0.369 ~ L2 size
7 524289 5597 1714 0.306
3 1048577 6218 2022 0.325
Mark's code does not actually implement the usual or generic mergesort,
but rather a variant from Simon Tatham described here:
http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html
Simon's algorithm performs O(log N) passes over the entire input list,
doing merges of sublists that double in size on each pass. The generic
algorithm instead merges pairs of equal length lists as early as possible,
in recursive order. For either algorithm, the elements that extend the
list beyond power-of-two length are a special case, handled as nearly as
possible as a "rounding-up" to a full POT.
Some intuition for the locality of reference implications of merge order
may be gotten by watching this animation:
http://www.sorting-algorithms.com/merge-sort
Simon's algorithm requires only O(1) extra space rather than the generic
algorithm's O(log N), but in my non-recursive implementation the actual
O(log N) data is merely a vector of ~20 pointers, which I've put on the
stack.
Long-running list_sort() calls: If the list passed in may be long, or the
client's cmp() callback function is slow, the client's cmp() may
periodically invoke cond_resched() to voluntarily yield the CPU. All
inner loops of list_sort() call back to cmp().
Stability of the sort: distinct elements that compare equal emerge from
the sort in the same order as with Mark's code, for simple test cases. A
boot-time test is provided to verify this and other correctness
requirements.
A kernel that uses drm.ko appears to run normally with this change; I have
no suitable hardware to similarly test the use by UBIFS.
[akpm@linux-foundation.org: style tweaks, fix comment, make list_sort_test __init]
Signed-off-by: Don Mullis <don.mullis@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Joe Perches <joe@perches.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add adds a debugfs interface and additional failure modes to LKDTM to
provide similar functionality to the provoke-crash driver submitted here:
http://lwn.net/Articles/371208/
Crashes can now be induced either through module parameters (as before)
or through the debugfs interface as in provoke-crash.
The patch also provides a new "direct" interface, where KPROBES are not
used, i.e., the crash is invoked directly upon write to the debugfs
file. When built without KPROBES configured, only this mode is available.
Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Cc: M. Mohan Kumar <mohan@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the same log level for printk's in show_mem(), so that those messages
can be shown completely when using log level 6.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The x86-64 implementation of the atomics is totally different from the
i586+ implementation, which makes it quite confusing to call it
"586+". Also fix indentation, and add "i" for "i386" and "i586" as
used elsewhere in the kernel.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267005265-27958-4-git-send-email-luca@luca-barbieri.com>
atomic64_inc_not_zero must return 1 if it perfomed the add and 0 otherwise.
The test assumed the opposite convention.
Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267469749-11878-5-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
atomic64_add_unless must return 1 if it perfomed the add and 0 otherwise.
The generic implementation did the opposite thing.
Reported-by: H. Peter Anvin <hpa@zytor.com>
Confirmed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267469749-11878-4-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
atomic64_add_unless must return 1 if it perfomed the add and 0 otherwise.
The test assumed the opposite convention.
Reported-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267469749-11878-2-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add support for atomic_dec_if_positive(), and
atomic64_dec_if_positive() for x86-64.
atomic64_dec_if_positive() for x86-32 was already implemented in a previous patch.
Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267183361-20775-2-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Currently atomic64_dec_if_positive() is only supported by PowerPC,
MIPS and x86-32.
Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267183361-20775-1-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits)
rcu: Fix accelerated GPs for last non-dynticked CPU
rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot
rcu: Fix accelerated grace periods for last non-dynticked CPU
rcu: Export rcu_scheduler_active
rcu: Make rcu_read_lock_sched_held() take boot time into account
rcu: Make lockdep_rcu_dereference() message less alarmist
sched, cgroups: Fix module export
rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
rcu: Fix rcutorture mod_timer argument to delay one jiffy
rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
rcu: Convert to raw_spinlocks
rcu: Stop overflowing signed integers
rcu: Use canonical URL for Mathieu's dissertation
rcu: Accelerate grace period if last non-dynticked CPU
rcu: Fix citation of Mathieu's dissertation
rcu: Documentation update for CONFIG_PROVE_RCU
security: Apply lockdep-based checking to rcu_dereference() uses
idr: Apply lockdep-based diagnostics to rcu_dereference() uses
radix-tree: Disable RCU lockdep checking in radix tree
vfs: Abstract rcu_dereference_check for files-fdtable use
...
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (88 commits)
powerpc: Fix lwsync feature fixup vs. modules on 64-bit
powerpc: Convert pmc_owner_lock to raw_spinlock
powerpc: Convert die.lock to raw_spinlock
powerpc: Convert tlbivax_lock to raw_spinlock
powerpc: Convert mpic locks to raw_spinlock
powerpc: Convert pmac_pic_lock to raw_spinlock
powerpc: Convert big_irq_lock to raw_spinlock
powerpc: Convert feature_lock to raw_spinlock
powerpc: Convert i8259_lock to raw_spinlock
powerpc: Convert beat_htab_lock to raw_spinlock
powerpc: Convert confirm_error_lock to raw_spinlock
powerpc: Convert ipic_lock to raw_spinlock
powerpc: Convert native_tlbie_lock to raw_spinlock
powerpc: Convert beatic_irq_mask_lock to raw_spinlock
powerpc: Convert nv_lock to raw_spinlock
powerpc: Convert context_lock to raw_spinlock
powerpc/85xx: Add NOR, LEDs and PIB support for MPC8568E-MDS boards
powerpc/86xx: Enable VME driver on the GE SBC610
powerpc/86xx: Enable VME driver on the GE PPC9A
powerpc/86xx: Add MSI section to GE PPC9A DTS
...
Remove pointless union in the breakpoint field of hw_perf_event.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
I've forgot to add 'perf lock' line to command-list.txt,
so users of perf could not find perf lock when they type 'perf'.
Fixing command-list.txt requires document
(tools/perf/Documentation/perf-lock.txt).
But perf lock is too much "under construction" to write a
stable document, so this is something like pseudo document for now.
And I wrote description of perf lock at help section of
CONFIG_LOCK_STAT, this will navigate users of lock trace events.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
LKML-Reference: <1265267295-8388-1-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (187 commits)
sh: remove dead LED code for migo-r and ms7724se
sh: ecovec build fix for CONFIG_I2C=n
sh: ecovec r-standby support
sh: ms7724se r-standby support
sh: SH-Mobile R-standby register save/restore
clocksource: Fix up a registration/IRQ race in the sh drivers.
sh: ms7724: modify scan_timing for KEYSC
sh: ms7724: Add sh_sir support
sh: mach-ecovec24: Add sh_sir support
sh: wire up SET/GET_UNALIGN_CTL.
sh: allow alignment fault mode to be configured at kernel boot.
sh: sh7724: Update FSI/SPU2 clock
sh: always enable sh7724 vpu_clk and set to 166MHz on Ecovec
sh: add sh7724 kick callback to clk_div4_table
sh: introduce struct clk_div4_table
sh: clock-cpg div4 set_rate() shift fix
sh: Turn on speculative return for SH7785 and SH7786
sh: Merge legacy and dynamic PMB modes.
sh: Use uncached I/O helpers in PMB setup.
sh: Provide uncached I/O helpers.
...
This patch adds self-test on boot code for atomic64_t.
This has been used to test the later changes in this patchset.
Signed-off-by: Luca Barbieri <luca@luca-barbieri.com>
LKML-Reference: <1267005265-27958-4-git-send-email-luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
idr_get_next() was accidentally not exported when added. It is about
to be used by mtdcore, which may be built as a module.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Inspection is proving insufficient to catch all RCU misuses,
which is understandable given that rcu_dereference() might be
protected by any of four different flavors of RCU (RCU, RCU-bh,
RCU-sched, and SRCU), and might also/instead be protected by any
of a number of locking primitives. It is therefore time to
enlist the aid of lockdep.
This set of patches is inspired by earlier work by Peter
Zijlstra and Thomas Gleixner, and takes the following approach:
o Set up separate lockdep classes for RCU, RCU-bh, and RCU-sched.
o Set up separate lockdep classes for each instance of SRCU.
o Create primitives that check for being in an RCU read-side
critical section. These return exact answers if lockdep is
fully enabled, but if unsure, report being in an RCU read-side
critical section. (We want to avoid false positives!)
The primitives are:
For RCU: rcu_read_lock_held(void)
For RCU-bh: rcu_read_lock_bh_held(void)
For RCU-sched: rcu_read_lock_sched_held(void)
For SRCU: srcu_read_lock_held(struct srcu_struct *sp)
o Add rcu_dereference_check(), which takes a second argument
in which one places a boolean expression based on the above
primitives and/or lockdep_is_held().
o A new kernel configuration parameter, CONFIG_PROVE_RCU, enables
rcu_dereference_check(). This depends on CONFIG_PROVE_LOCKING,
and should be quite helpful during the transition period while
CONFIG_PROVE_RCU-unaware patches are in flight.
The existing rcu_dereference() primitive does no checking, but
upcoming patches will change that.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is retry of reverted 859ddf0974
("idr: fix a critical misallocation bug") which contained two bugs.
* pa[idp->layers] should be cleared even if it's not used by
sub_alloc() because it's used by mark idr_mark_full().
* The original condition check also assigned pa[l] to p which the new
code didn't do thus leaving p pointing at the wrong layer.
Both problems have been fixed and the idr code has received good amount
testing using userland testing setup where simple bitmap allocator is
run parallel to verify the result of idr allocation.
The bug this patch fixes is caused by sub_alloc() optimization path
bypassing out-of-room condition check and restarting allocation loop
with starting value higher than maximum allowed value. For detailed
description, please read commit message of 859ddf09.
Signed-off-by: Tejun Heo <tj@kernel.org>
Based-on-patch-from: Eric Paris <eparis@redhat.com>
Reported-by: Eric Paris <eparis@redhat.com>
Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for augmented rbtrees in core rbtree code.
This will be used in subsequent patches, in x86 PAT code, which needs
interval trees to efficiently keep track of PAT ranges.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <20100210232343.GA11465@linux-os.sc.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Commit 859ddf0974 tried to fix
misallocation bug but broke full bit marking by not clearing
pa[idp->layers] and also is causing X failures due to lookup failure
in drm code. The cause of the latter hasn't been found yet. Revert
the fix for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can free memory allocated with lmb_alloc() by removing it from the
list of reserved LMBs. Rework lmb_remove() to allow that possibility
and add lmb_free() which exploits it.
BenH: Removed some useless parenthesis
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Eric Paris located a bug in idr. With IDR_BITS of 6, it grows to three
layers when id 4096 is first allocated. When that happens, idr wraps
incorrectly and searches the idr array ignoring the high bits. The
following test code from Eric demonstrates the bug nicely.
#include <linux/idr.h>
#include <linux/kernel.h>
#include <linux/module.h>
static DEFINE_IDR(test_idr);
int init_module(void)
{
int ret, forty95, forty96;
void *addr;
/* add 2 entries both with 4095 as the start address */
again1:
if (!idr_pre_get(&test_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new_above(&test_idr, (void *)4095, 4095, &forty95);
if (ret) {
if (ret == -EAGAIN)
goto again1;
return ret;
}
if (forty95 != 4095)
printk(KERN_ERR "hmmm, forty95=%d\n", forty95);
again2:
if (!idr_pre_get(&test_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new_above(&test_idr, (void *)4096, 4095, &forty96);
if (ret) {
if (ret == -EAGAIN)
goto again2;
return ret;
}
if (forty96 != 4096)
printk(KERN_ERR "hmmm, forty96=%d\n", forty96);
/* try to find the 2 entries, noticing that 4096 broke */
addr = idr_find(&test_idr, forty95);
if ((int)addr != forty95)
printk(KERN_ERR "hmmm, after find forty95=%d addr=%d\n", forty95, (int)addr);
addr = idr_find(&test_idr, forty96);
if ((int)addr != forty96)
printk(KERN_ERR "hmmm, after find forty96=%d addr=%d\n", forty96, (int)addr);
/* really weird, the entry which should be at 4096 is actually at 0!! */
addr = idr_find(&test_idr, 0);
if ((int)addr)
printk(KERN_ERR "found an entry at id=0 for addr=%d\n", (int)addr);
idr_remove(&test_idr, forty95);
idr_remove(&test_idr, forty96);
return 0;
}
void cleanup_module(void)
{
}
MODULE_AUTHOR("Eric Paris <eparis@redhat.com>");
MODULE_DESCRIPTION("Simple idr test");
MODULE_LICENSE("GPL");
This happens because when sub_alloc() back tracks it doesn't always do it
step-by-step while the over-the-limit detection assumes step-by-step
backtracking. The logic in sub_alloc() looks like the following.
restart:
clear pa[top level + 1] for end cond detection
l = top level
while (true) {
search for empty slot at this level
if (not found) {
push id to the next possible value
l++
A: if (pa[l] is clear)
failed, return asking caller to grow the tree
if (going up 1 level gives more slots to search)
continue the while loop above with the incremented l
else
C: goto restart
}
adjust id accordingly to the found slot
if (l == 0)
return found id;
create lower level if not there yet
record pa[l] and l--
}
Test A is the fail exit condition but this assumes that failure is
propagated upwared one level at a time but the B optimization path breaks
the assumption and restarts the whole thing with a start value which is
above the possible limit with the current layers. sub_alloc() assumes the
start id value is inside the limit when called and test A is the only exit
condition check, so it ends up searching for empty slot while ignoring
high set bit.
So, for 4095->4096 test, level0 search fails but pa[1] contains a valid
pointer. However, going up 1 level wouldn't give any more empty slot so
it takes C and when the whole thing restarts nobody notices the high bit
set beyond the top level.
This patch fixes the bug by changing the fail exit condition check to full
id limit check.
Based-on-patch-from: Eric Paris <eparis@redhat.com>
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>