Merge /spare/repo/linux-2.6/

This commit is contained in:
Jeff Garzik 2005-09-14 08:01:25 -04:00
commit d7f6884ae0
4536 changed files with 277651 additions and 156288 deletions

View File

@ -18,7 +18,7 @@
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
@ -321,7 +321,7 @@ the "copyright" line and a pointer to where the full notice is found.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
Also add information on how to contact you by electronic and paper mail.

View File

@ -2423,8 +2423,7 @@ S: Toronto, Ontario
S: Canada
N: Zwane Mwaikambo
E: zwane@linuxpower.ca
W: http://function.linuxpower.ca
E: zwane@arm.linux.org.uk
D: Various driver hacking
D: Lowlevel x86 kernel hacking
D: General debugging

View File

@ -46,6 +46,8 @@ SubmittingPatches
- procedure to get a source patch included into the kernel tree.
VGA-softcursor.txt
- how to change your VGA cursor from a blinking underscore.
applying-patches.txt
- description of various trees and how to apply their patches.
arm/
- directory with info about Linux on the ARM architecture.
basic_profiling.txt
@ -275,7 +277,7 @@ tty.txt
unicode.txt
- info on the Unicode character/font mapping used in Linux.
uml/
- directory with infomation about User Mode Linux.
- directory with information about User Mode Linux.
usb/
- directory with info regarding the Universal Serial Bus.
video4linux/

View File

@ -236,6 +236,9 @@ ugly), but try to avoid excess. Instead, put the comments at the head
of the function, telling people what it does, and possibly WHY it does
it.
When commenting the kernel API functions, please use the kerneldoc format.
See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
for details.
Chapter 8: You've made a mess of it

View File

@ -121,7 +121,7 @@ pool's device.
dma_addr_t addr);
This puts memory back into the pool. The pool is what was passed to
the the pool allocation routine; the cpu and dma addresses are what
the pool allocation routine; the cpu and dma addresses are what
were returned when that routine allocated the memory being freed.

View File

@ -0,0 +1,151 @@
DMA with ISA and LPC devices
============================
Pierre Ossman <drzeus@drzeus.cx>
This document describes how to do DMA transfers using the old ISA DMA
controller. Even though ISA is more or less dead today the LPC bus
uses the same DMA system so it will be around for quite some time.
Part I - Headers and dependencies
---------------------------------
To do ISA style DMA you need to include two headers:
#include <linux/dma-mapping.h>
#include <asm/dma.h>
The first is the generic DMA API used to convert virtual addresses to
physical addresses (see Documentation/DMA-API.txt for details).
The second contains the routines specific to ISA DMA transfers. Since
this is not present on all platforms make sure you construct your
Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
to build your driver on unsupported platforms.
Part II - Buffer allocation
---------------------------
The ISA DMA controller has some very strict requirements on which
memory it can access so extra care must be taken when allocating
buffers.
(You usually need a special buffer for DMA transfers instead of
transferring directly to and from your normal data structures.)
The DMA-able address space is the lowest 16 MB of _physical_ memory.
Also the transfer block may not cross page boundaries (which are 64
or 128 KiB depending on which channel you use).
In order to allocate a piece of memory that satisfies all these
requirements you pass the flag GFP_DMA to kmalloc.
Unfortunately the memory available for ISA DMA is scarce so unless you
allocate the memory during boot-up it's a good idea to also pass
__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder.
(This scarcity also means that you should allocate the buffer as
early as possible and not release it until the driver is unloaded.)
Part III - Address translation
------------------------------
To translate the virtual address to a physical use the normal DMA
API. Do _not_ use isa_virt_to_phys() even though it does the same
thing. The reason for this is that the function isa_virt_to_phys()
will require a Kconfig dependency to ISA, not just ISA_DMA_API which
is really all you need. Remember that even though the DMA controller
has its origins in ISA it is used elsewhere.
Note: x86_64 had a broken DMA API when it came to ISA but has since
been fixed. If your arch has problems then fix the DMA API instead of
reverting to the ISA functions.
Part IV - Channels
------------------
A normal ISA DMA controller has 8 channels. The lower four are for
8-bit transfers and the upper four are for 16-bit transfers.
(Actually the DMA controller is really two separate controllers where
channel 4 is used to give DMA access for the second controller (0-3).
This means that of the four 16-bits channels only three are usable.)
You allocate these in a similar fashion as all basic resources:
extern int request_dma(unsigned int dmanr, const char * device_id);
extern void free_dma(unsigned int dmanr);
The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
driver author but depends on what the hardware supports. Check your
specs or test different channels.
Part V - Transfer data
----------------------
Now for the good stuff, the actual DMA transfer. :)
Before you use any ISA DMA routines you need to claim the DMA lock
using claim_dma_lock(). The reason is that some DMA operations are
not atomic so only one driver may fiddle with the registers at a
time.
The first time you use the DMA controller you should call
clear_dma_ff(). This clears an internal register in the DMA
controller that is used for the non-atomic operations. As long as you
(and everyone else) uses the locking functions then you only need to
reset this once.
Next, you tell the controller in which direction you intend to do the
transfer using set_dma_mode(). Currently you have the options
DMA_MODE_READ and DMA_MODE_WRITE.
Set the address from where the transfer should start (this needs to
be 16-bit aligned for 16-bit transfers) and how many bytes to
transfer. Note that it's _bytes_. The DMA routines will do all the
required translation to values that the DMA controller understands.
The final step is enabling the DMA channel and releasing the DMA
lock.
Once the DMA transfer is finished (or timed out) you should disable
the channel again. You should also check get_dma_residue() to make
sure that all data has been transfered.
Example:
int flags, residue;
flags = claim_dma_lock();
clear_dma_ff();
set_dma_mode(channel, DMA_MODE_WRITE);
set_dma_addr(channel, phys_addr);
set_dma_count(channel, num_bytes);
dma_enable(channel);
release_dma_lock(flags);
while (!device_done());
flags = claim_dma_lock();
dma_disable(channel);
residue = dma_get_residue(channel);
if (residue != 0)
printk(KERN_ERR "driver: Incomplete DMA transfer!"
" %d bytes left!\n", residue);
release_dma_lock(flags);
Part VI - Suspend/resume
------------------------
It is the driver's responsibility to make sure that the machine isn't
suspended while a DMA transfer is in progress. Also, all DMA settings
are lost when the system suspends so if your driver relies on the DMA
controller being in a certain state then you have to restore these
registers upon resume.

View File

@ -116,7 +116,7 @@ filesystem. Almost.
You still need to actually journal your filesystem changes, this
is done by wrapping them into transactions. Additionally you
also need to wrap the modification of each of the the buffers
also need to wrap the modification of each of the buffers
with calls to the journal layer, so it knows what the modifications
you are actually making are. To do this use journal_start() which
returns a transaction handle.
@ -128,7 +128,7 @@ and its counterpart journal_stop(), which indicates the end of a transaction
are nestable calls, so you can reenter a transaction if necessary,
but remember you must call journal_stop() the same number of times as
journal_start() before the transaction is completed (or more accurately
leaves the the update phase). Ext3/VFS makes use of this feature to simplify
leaves the update phase). Ext3/VFS makes use of this feature to simplify
quota support.
</para>

View File

@ -8,8 +8,7 @@
<authorgroup>
<author>
<firstname>Paul</firstname>
<othername>Rusty</othername>
<firstname>Rusty</firstname>
<surname>Russell</surname>
<affiliation>
<address>
@ -20,7 +19,7 @@
</authorgroup>
<copyright>
<year>2001</year>
<year>2005</year>
<holder>Rusty Russell</holder>
</copyright>
@ -64,7 +63,7 @@
<chapter id="introduction">
<title>Introduction</title>
<para>
Welcome, gentle reader, to Rusty's Unreliable Guide to Linux
Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux
Kernel Hacking. This document describes the common routines and
general requirements for kernel code: its goal is to serve as a
primer for Linux kernel development for experienced C
@ -96,13 +95,13 @@
<listitem>
<para>
not associated with any process, serving a softirq, tasklet or bh;
not associated with any process, serving a softirq or tasklet;
</para>
</listitem>
<listitem>
<para>
running in kernel space, associated with a process;
running in kernel space, associated with a process (user context);
</para>
</listitem>
@ -114,11 +113,12 @@
</itemizedlist>
<para>
There is a strict ordering between these: other than the last
category (userspace) each can only be pre-empted by those above.
For example, while a softirq is running on a CPU, no other
softirq will pre-empt it, but a hardware interrupt can. However,
any other CPUs in the system execute independently.
There is an ordering between these. The bottom two can preempt
each other, but above that is a strict hierarchy: each can only be
preempted by the ones above it. For example, while a softirq is
running on a CPU, no other softirq will preempt it, but a hardware
interrupt can. However, any other CPUs in the system execute
independently.
</para>
<para>
@ -130,10 +130,10 @@
<title>User Context</title>
<para>
User context is when you are coming in from a system call or
other trap: you can sleep, and you own the CPU (except for
interrupts) until you call <function>schedule()</function>.
In other words, user context (unlike userspace) is not pre-emptable.
User context is when you are coming in from a system call or other
trap: like userspace, you can be preempted by more important tasks
and by interrupts. You can sleep, by calling
<function>schedule()</function>.
</para>
<note>
@ -153,7 +153,7 @@
<caution>
<para>
Beware that if you have interrupts or bottom halves disabled
Beware that if you have preemption or softirqs disabled
(see below), <function>in_interrupt()</function> will return a
false positive.
</para>
@ -168,10 +168,10 @@
<hardware>keyboard</hardware> are examples of real
hardware which produce interrupts at any time. The kernel runs
interrupt handlers, which services the hardware. The kernel
guarantees that this handler is never re-entered: if another
guarantees that this handler is never re-entered: if the same
interrupt arrives, it is queued (or dropped). Because it
disables interrupts, this handler has to be fast: frequently it
simply acknowledges the interrupt, marks a `software interrupt'
simply acknowledges the interrupt, marks a 'software interrupt'
for execution and exits.
</para>
@ -188,60 +188,52 @@
</sect1>
<sect1 id="basics-softirqs">
<title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title>
<title>Software Interrupt Context: Softirqs and Tasklets</title>
<para>
Whenever a system call is about to return to userspace, or a
hardware interrupt handler exits, any `software interrupts'
hardware interrupt handler exits, any 'software interrupts'
which are marked pending (usually by hardware interrupts) are
run (<filename>kernel/softirq.c</filename>).
</para>
<para>
Much of the real interrupt handling work is done here. Early in
the transition to <acronym>SMP</acronym>, there were only `bottom
the transition to <acronym>SMP</acronym>, there were only 'bottom
halves' (BHs), which didn't take advantage of multiple CPUs. Shortly
after we switched from wind-up computers made of match-sticks and snot,
we abandoned this limitation.
we abandoned this limitation and switched to 'softirqs'.
</para>
<para>
<filename class="headerfile">include/linux/interrupt.h</filename> lists the
different BH's. No matter how many CPUs you have, no two BHs will run at
the same time. This made the transition to SMP simpler, but sucks hard for
scalable performance. A very important bottom half is the timer
BH (<filename class="headerfile">include/linux/timer.h</filename>): you
can register to have it call functions for you in a given length of time.
different softirqs. A very important softirq is the
timer softirq (<filename
class="headerfile">include/linux/timer.h</filename>): you can
register to have it call functions for you in a given length of
time.
</para>
<para>
2.3.43 introduced softirqs, and re-implemented the (now
deprecated) BHs underneath them. Softirqs are fully-SMP
versions of BHs: they can run on as many CPUs at once as
required. This means they need to deal with any races in shared
data using their own locks. A bitmask is used to keep track of
which are enabled, so the 32 available softirqs should not be
used up lightly. (<emphasis>Yes</emphasis>, people will
notice).
</para>
<para>
tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>)
are like softirqs, except they are dynamically-registrable (meaning you
can have as many as you want), and they also guarantee that any tasklet
will only run on one CPU at any time, although different tasklets can
run simultaneously (unlike different BHs).
Softirqs are often a pain to deal with, since the same softirq
will run simultaneously on more than one CPU. For this reason,
tasklets (<filename
class="headerfile">include/linux/interrupt.h</filename>) are more
often used: they are dynamically-registrable (meaning you can have
as many as you want), and they also guarantee that any tasklet
will only run on one CPU at any time, although different tasklets
can run simultaneously.
</para>
<caution>
<para>
The name `tasklet' is misleading: they have nothing to do with `tasks',
The name 'tasklet' is misleading: they have nothing to do with 'tasks',
and probably more to do with some bad vodka Alexey Kuznetsov had at the
time.
</para>
</caution>
<para>
You can tell you are in a softirq (or bottom half, or tasklet)
You can tell you are in a softirq (or tasklet)
using the <function>in_softirq()</function> macro
(<filename class="headerfile">include/linux/interrupt.h</filename>).
</para>
@ -288,11 +280,10 @@
<term>A rigid stack limit</term>
<listitem>
<para>
The kernel stack is about 6K in 2.2 (for most
architectures: it's about 14K on the Alpha), and shared
with interrupts so you can't use it all. Avoid deep
recursion and huge local arrays on the stack (allocate
them dynamically instead).
Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's
about 14K on most 64-bit archs, and often shared with interrupts
so you can't use it all. Avoid deep recursion and huge local
arrays on the stack (allocate them dynamically instead).
</para>
</listitem>
</varlistentry>
@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg)
<para>
If all your routine does is read or write some parameter, consider
implementing a <function>sysctl</function> interface instead.
implementing a <function>sysfs</function> interface instead.
</para>
<para>
@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */
</para>
<para>
You will eventually lock up your box if you break these rules.
You should always compile your kernel
<symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn
you if you break these rules. If you <emphasis>do</emphasis> break
the rules, you will eventually lock up your box.
</para>
<para>
@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
success).
</para>
</caution>
[Yes, this moronic interface makes me cringe. Please submit a
patch and become my hero --RR.]
[Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.]
</para>
<para>
The functions may sleep implicitly. This should never be called
@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
</variablelist>
<para>
If you see a <errorname>kmem_grow: Called nonatomically from int
</errorname> warning message you called a memory allocation function
from interrupt context without <constant>GFP_ATOMIC</constant>.
You should really fix that. Run, don't walk.
If you see a <errorname>sleeping function called from invalid
context</errorname> warning message, then maybe you called a
sleeping allocation function from interrupt context without
<constant>GFP_ATOMIC</constant>. You should really fix that.
Run, don't walk.
</para>
<para>
@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
</sect1>
<sect1 id="routines-udelay">
<title><function>udelay()</function>/<function>mdelay()</function>
<title><function>mdelay()</function>/<function>udelay()</function>
<filename class="headerfile">include/asm/delay.h</filename>
<filename class="headerfile">include/linux/delay.h</filename>
</title>
<para>
The <function>udelay()</function> function can be used for small pauses.
Do not use large values with <function>udelay()</function> as you risk
The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses.
Do not use large values with them as you risk
overflow - the helper function <function>mdelay()</function> is useful
here, or even consider <function>schedule_timeout()</function>.
here, or consider <function>msleep()</function>.
</para>
</sect1>
@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
These routines disable soft interrupts on the local CPU, and
restore them. They are reentrant; if soft interrupts were
disabled before, they will still be disabled after this pair
of functions has been called. They prevent softirqs, tasklets
and bottom halves from running on the current CPU.
of functions has been called. They prevent softirqs and tasklets
from running on the current CPU.
</para>
</sect1>
@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/asm/smp.h</filename></title>
<para>
<function>smp_processor_id()</function> returns the current
processor number, between 0 and <symbol>NR_CPUS</symbol> (the
maximum number of CPUs supported by Linux, currently 32). These
values are not necessarily continuous.
<function>get_cpu()</function> disables preemption (so you won't
suddenly get moved to another CPU) and returns the current
processor number, between 0 and <symbol>NR_CPUS</symbol>. Note
that the CPU numbers are not necessarily continuous. You return
it again with <function>put_cpu()</function> when you are done.
</para>
<para>
If you know you cannot be preempted by another task (ie. you are
in interrupt context, or have preemption disabled) you can use
smp_processor_id().
</para>
</sect1>
@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
After boot, the kernel frees up a special section; functions
marked with <type>__init</type> and data structures marked with
<type>__initdata</type> are dropped after boot is complete (within
modules this directive is currently ignored). <type>__exit</type>
<type>__initdata</type> are dropped after boot is complete: similarly
modules discard this memory after initialization. <type>__exit</type>
is used to declare a function which is only required on exit: the
function will be dropped if this file is not compiled as a module.
See the header file for use. Note that it makes no sense for a function
marked with <type>__init</type> to be exported to modules with
<function>EXPORT_SYMBOL()</function> - this will break.
</para>
<para>
Static data structures marked as <type>__initdata</type> must be initialised
(as opposed to ordinary static data which is zeroed BSS) and cannot be
<type>const</type>.
</para>
</sect1>
@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
The function can return a negative error number to cause
module loading to fail (unfortunately, this has no effect if
the module is compiled into the kernel). For modules, this is
called in user context, with interrupts enabled, and the
kernel lock held, so it can sleep.
the module is compiled into the kernel). This function is
called in user context with interrupts enabled, so it can sleep.
</para>
</sect1>
@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
reached zero. This function can also sleep, but cannot fail:
everything must be cleaned up by the time it returns.
</para>
<para>
Note that this macro is optional: if it is not present, your
module will not be removable (except for 'rmmod -f').
</para>
</sect1>
<sect1 id="routines-module-use-counters">
<title> <function>try_module_get()</function>/<function>module_put()</function>
<filename class="headerfile">include/linux/module.h</filename></title>
<para>
These manipulate the module usage count, to protect against
removal (a module also can't be removed if another module uses one
of its exported symbols: see below). Before calling into module
code, you should call <function>try_module_get()</function> on
that module: if it fails, then the module is being removed and you
should act as if it wasn't there. Otherwise, you can safely enter
the module, and call <function>module_put()</function> when you're
finished.
</para>
<para>
Most registerable structures have an
<structfield>owner</structfield> field, such as in the
<structname>file_operations</structname> structure. Set this field
to the macro <symbol>THIS_MODULE</symbol>.
</para>
</sect1>
<!-- add info on new-style module refcounting here -->
@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
There is a macro to do this:
<function>wait_event_interruptible()</function>
<filename class="headerfile">include/linux/sched.h</filename> The
<filename class="headerfile">include/linux/wait.h</filename> The
first argument is the wait queue head, and the second is an
expression which is evaluated; the macro returns
<returnvalue>0</returnvalue> when this expression is true, or
@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
Call <function>wake_up()</function>
<filename class="headerfile">include/linux/sched.h</filename>;,
<filename class="headerfile">include/linux/wait.h</filename>;,
which will wake up every process in the queue. The exception is
if one has <constant>TASK_EXCLUSIVE</constant> set, in which case
the remainder of the queue will not be woken.
the remainder of the queue will not be woken. There are other variants
of this basic function available in the same header.
</para>
</sect1>
</chapter>
@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
first class of operations work on <type>atomic_t</type>
<filename class="headerfile">include/asm/atomic.h</filename>; this
contains a signed integer (at least 24 bits long), and you must use
contains a signed integer (at least 32 bits long), and you must use
these functions to manipulate or read atomic_t variables.
<function>atomic_read()</function> and
<function>atomic_set()</function> get and set the counter,
@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
Note that these functions are slower than normal arithmetic, and
so should not be used unnecessarily. On some platforms they
are much slower, like 32-bit Sparc where they use a spinlock.
so should not be used unnecessarily.
</para>
<para>
The second class of atomic operations is atomic bit operations on a
<type>long</type>, defined in
The second class of atomic operations is atomic bit operations on an
<type>unsigned long</type>, defined in
<filename class="headerfile">include/linux/bitops.h</filename>. These
operations generally take a pointer to the bit pattern, and a bit
@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<function>test_and_clear_bit()</function> and
<function>test_and_change_bit()</function> do the same thing,
except return true if the bit was previously set; these are
particularly useful for very simple locking.
particularly useful for atomically setting flags.
</para>
<para>
@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
than BITS_PER_LONG. The resulting behavior is strange on big-endian
platforms though so it is a good idea not to do this.
</para>
<para>
Note that the order of bits depends on the architecture, and in
particular, the bitfield passed to these operations must be at
least as large as a <type>long</type>.
</para>
</chapter>
<chapter id="symbols">
@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/linux/module.h</filename></title>
<para>
This is the classic method of exporting a symbol, and it works
for both modules and non-modules. In the kernel all these
declarations are often bundled into a single file to help
genksyms (which searches source files for these declarations).
See the comment on genksyms and Makefiles below.
This is the classic method of exporting a symbol: dynamically
loaded modules will be able to use the symbol as normal.
</para>
</sect1>
@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can
only be seen by modules with a
<function>MODULE_LICENSE()</function> that specifies a GPL
compatible license.
compatible license. It implies that the function is considered
an internal implementation issue, and not really an interface.
</para>
</sect1>
</chapter>
@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/linux/list.h</filename></title>
<para>
There are three sets of linked-list routines in the kernel
headers, but this one seems to be winning out (and Linus has
used it). If you don't have some particular pressing need for
a single list, it's a good choice. In fact, I don't care
whether it's a good choice or not, just use it so we can get
rid of the others.
There used to be three sets of linked-list routines in the kernel
headers, but this one is the winner. If you don't have some
particular pressing need for a single list, it's a good choice.
</para>
<para>
In particular, <function>list_for_each_entry</function> is useful.
</para>
</sect1>
@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
convention, and return <returnvalue>0</returnvalue> for success,
and a negative error number
(eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be
unintuitive at first, but it's fairly widespread in the networking
code, for example.
unintuitive at first, but it's fairly widespread in the kernel.
</para>
<para>
The filesystem code uses <function>ERR_PTR()</function>
Using <function>ERR_PTR()</function>
<filename class="headerfile">include/linux/fs.h</filename>; to
<filename class="headerfile">include/linux/err.h</filename>; to
encode a negative error number into a pointer, and
<function>IS_ERR()</function> and <function>PTR_ERR()</function>
to get it back out again: avoids a separate pointer parameter for
@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = {
supported, due to lack of general use, but the following are
considered standard (see the GCC info page section "C
Extensions" for more details - Yes, really the info page, the
man page is only a short summary of the stuff in info):
man page is only a short summary of the stuff in info).
</para>
<itemizedlist>
<listitem>
@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = {
</listitem>
<listitem>
<para>
Function names as strings (__FUNCTION__)
Function names as strings (__func__).
</para>
</listitem>
<listitem>
@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = {
<listitem>
<para>
Usually you want a configuration option for your kernel hack.
Edit <filename>Config.in</filename> in the appropriate directory
(but under <filename>arch/</filename> it's called
<filename>config.in</filename>). The Config Language used is not
bash, even though it looks like bash; the safe way is to use only
the constructs that you already see in
<filename>Config.in</filename> files (see
<filename>Documentation/kbuild/kconfig-language.txt</filename>).
It's good to run "make xconfig" at least once to test (because
it's the only one with a static parser).
</para>
<para>
Variables which can be Y or N use <type>bool</type> followed by a
tagline and the config define name (which must start with
CONFIG_). The <type>tristate</type> function is the same, but
allows the answer M (which defines
<symbol>CONFIG_foo_MODULE</symbol> in your source, instead of
<symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol>
is enabled.
Edit <filename>Kconfig</filename> in the appropriate directory.
The Config language is simple to use by cut and paste, and there's
complete documentation in
<filename>Documentation/kbuild/kconfig-language.txt</filename>.
</para>
<para>
You may well want to make your CONFIG option only visible if
<symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a
warning to users. There many other fancy things you can do: see
the various <filename>Config.in</filename> files for ideas.
the various <filename>Kconfig</filename> files for ideas.
</para>
<para>
In your description of the option, make sure you address both the
expert user and the user who knows nothing about your feature. Mention
incompatibilities and issues here. <emphasis> Definitely
</emphasis> end your description with <quote> if in doubt, say N
</quote> (or, occasionally, `Y'); this is for people who have no
idea what you are talking about.
</para>
</listitem>
<listitem>
<para>
Edit the <filename>Makefile</filename>: the CONFIG variables are
exported here so you can conditionalize compilation with `ifeq'.
If your file exports symbols then add the names to
<varname>export-objs</varname> so that genksyms will find them.
<caution>
<para>
There is a restriction on the kernel build system that objects
which export symbols must have globally unique names.
If your object does not have a globally unique name then the
standard fix is to move the
<function>EXPORT_SYMBOL()</function> statements to their own
object with a unique name.
This is why several systems have separate exporting objects,
usually suffixed with ksyms.
</para>
</caution>
</para>
</listitem>
<listitem>
<para>
Document your option in Documentation/Configure.help. Mention
incompatibilities and issues here. <emphasis> Definitely
</emphasis> end your description with <quote> if in doubt, say N
</quote> (or, occasionally, `Y'); this is for people who have no
idea what you are talking about.
exported here so you can usually just add a "obj-$(CONFIG_xxx) +=
xxx.o" line. The syntax is documented in
<filename>Documentation/kbuild/makefiles.txt</filename>.
</para>
</listitem>
@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = {
</para>
<para>
<filename>include/linux/brlock.h:</filename>
<filename>include/asm-i386/delay.h:</filename>
</para>
<programlisting>
extern inline void br_read_lock (enum brlock_indices idx)
{
/*
* This causes a link-time bug message if an
* invalid index is used:
*/
if (idx >= __BR_END)
__br_lock_usage_bug();
read_lock(&amp;__brlock_array[smp_processor_id()][idx]);
}
#define ndelay(n) (__builtin_constant_p(n) ? \
((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
__ndelay(n))
</programlisting>
<para>

View File

@ -96,7 +96,7 @@
<chapter id="pubfunctions">
<title>Public Functions Provided</title>
!Earch/i386/kernel/mca.c
!Edrivers/mca/mca-legacy.c
</chapter>
<chapter id="dmafunctions">

View File

@ -841,7 +841,7 @@ usbdev_ioctl (int fd, int ifno, unsigned request, void *param)
File modification time is not updated by this request.
</para><para>
Those struct members are from some interface descriptor
applying to the the current configuration.
applying to the current configuration.
The interface number is the bInterfaceNumber value, and
the altsetting number is the bAlternateSetting value.
(This resets each endpoint in the interface.)

View File

@ -605,12 +605,13 @@ is in the ipmi_poweroff module. When the system requests a powerdown,
it will send the proper IPMI commands to do this. This is supported on
several platforms.
There is a module parameter named "poweroff_control" that may either be zero
(do a power down) or 2 (do a power cycle, power the system off, then power
it on in a few seconds). Setting ipmi_poweroff.poweroff_control=x will do
the same thing on the kernel command line. The parameter is also available
via the proc filesystem in /proc/ipmi/poweroff_control. Note that if the
system does not support power cycling, it will always to the power off.
There is a module parameter named "poweroff_powercycle" that may
either be zero (do a power down) or non-zero (do a power cycle, power
the system off, then power it on in a few seconds). Setting
ipmi_poweroff.poweroff_control=x will do the same thing on the kernel
command line. The parameter is also available via the proc filesystem
in /proc/sys/dev/ipmi/poweroff_powercycle. Note that if the system
does not support power cycling, it will always do the power off.
Note that if you have ACPI enabled, the system will prefer using ACPI to
power off.

View File

@ -430,7 +430,7 @@ which may result in system hang. The software driver of specific
MSI-capable hardware is responsible for whether calling
pci_enable_msi or not. A return of zero indicates the kernel
successfully initializes the MSI/MSI-X capability structure of the
device funtion. The device function is now running on MSI/MSI-X mode.
device function. The device function is now running on MSI/MSI-X mode.
5.6 How to tell whether MSI/MSI-X is enabled on device function

View File

@ -0,0 +1,112 @@
Using RCU to Protect Dynamic NMI Handlers
Although RCU is usually used to protect read-mostly data structures,
it is possible to use RCU to provide dynamic non-maskable interrupt
handlers, as well as dynamic irq handlers. This document describes
how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
work in "arch/i386/oprofile/nmi_timer_int.c" and in
"arch/i386/kernel/traps.c".
The relevant pieces of code are listed below, each followed by a
brief explanation.
static int dummy_nmi_callback(struct pt_regs *regs, int cpu)
{
return 0;
}
The dummy_nmi_callback() function is a "dummy" NMI handler that does
nothing, but returns zero, thus saying that it did nothing, allowing
the NMI handler to take the default machine-specific action.
static nmi_callback_t nmi_callback = dummy_nmi_callback;
This nmi_callback variable is a global function pointer to the current
NMI handler.
fastcall void do_nmi(struct pt_regs * regs, long error_code)
{
int cpu;
nmi_enter();
cpu = smp_processor_id();
++nmi_count(cpu);
if (!rcu_dereference(nmi_callback)(regs, cpu))
default_do_nmi(regs);
nmi_exit();
}
The do_nmi() function processes each NMI. It first disables preemption
in the same way that a hardware irq would, then increments the per-CPU
count of NMIs. It then invokes the NMI handler stored in the nmi_callback
function pointer. If this handler returns zero, do_nmi() invokes the
default_do_nmi() function to handle a machine-specific NMI. Finally,
preemption is restored.
Strictly speaking, rcu_dereference() is not needed, since this code runs
only on i386, which does not need rcu_dereference() anyway. However,
it is a good documentation aid, particularly for anyone attempting to
do something similar on Alpha.
Quick Quiz: Why might the rcu_dereference() be necessary on Alpha,
given that the code referenced by the pointer is read-only?
Back to the discussion of NMI and RCU...
void set_nmi_callback(nmi_callback_t callback)
{
rcu_assign_pointer(nmi_callback, callback);
}
The set_nmi_callback() function registers an NMI handler. Note that any
data that is to be used by the callback must be initialized up -before-
the call to set_nmi_callback(). On architectures that do not order
writes, the rcu_assign_pointer() ensures that the NMI handler sees the
initialized values.
void unset_nmi_callback(void)
{
rcu_assign_pointer(nmi_callback, dummy_nmi_callback);
}
This function unregisters an NMI handler, restoring the original
dummy_nmi_handler(). However, there may well be an NMI handler
currently executing on some other CPU. We therefore cannot free
up any data structures used by the old NMI handler until execution
of it completes on all other CPUs.
One way to accomplish this is via synchronize_sched(), perhaps as
follows:
unset_nmi_callback();
synchronize_sched();
kfree(my_nmi_data);
This works because synchronize_sched() blocks until all CPUs complete
any preemption-disabled segments of code that they were executing.
Since NMI handlers disable preemption, synchronize_sched() is guaranteed
not to return until all ongoing NMI handlers exit. It is therefore safe
to free up the handler's data as soon as synchronize_sched() returns.
Answer to Quick Quiz
Why might the rcu_dereference() be necessary on Alpha, given
that the code referenced by the pointer is read-only?
Answer: The caller to set_nmi_callback() might well have
initialized some data that is to be used by the
new NMI handler. In this case, the rcu_dereference()
would be needed, because otherwise a CPU that received
an NMI just after the new handler was set might see
the pointer to the new NMI handler, but the old
pre-initialized version of the handler's data.
More important, the rcu_dereference() makes it clear
to someone reading the code that the pointer is being
protected by RCU.

View File

@ -2,7 +2,8 @@ Read the F-ing Papers!
This document describes RCU-related publications, and is followed by
the corresponding bibtex entries.
the corresponding bibtex entries. A number of the publications may
be found at http://www.rdrop.com/users/paulmck/RCU/.
The first thing resembling RCU was published in 1980, when Kung and Lehman
[Kung80] recommended use of a garbage collector to defer destruction
@ -113,6 +114,10 @@ describing how to make RCU safe for soft-realtime applications [Sarma04c],
and a paper describing SELinux performance with RCU [JamesMorris04b].
2005 has seen further adaptation of RCU to realtime use, permitting
preemption of RCU realtime critical sections [PaulMcKenney05a,
PaulMcKenney05b].
Bibtex Entries
@article{Kung80
@ -410,3 +415,32 @@ Oregon Health and Sciences University"
\url{http://www.livejournal.com/users/james_morris/2153.html}
[Viewed December 10, 2004]"
}
@unpublished{PaulMcKenney05a
,Author="Paul E. McKenney"
,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress"
,month="May"
,year="2005"
,note="Available:
\url{http://lkml.org/lkml/2005/5/9/185}
[Viewed May 13, 2005]"
,annotation="
First publication of working lock-based deferred free patches
for the CONFIG_PREEMPT_RT environment.
"
}
@conference{PaulMcKenney05b
,Author="Paul E. McKenney and Dipankar Sarma"
,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware"
,Booktitle="linux.conf.au 2005"
,month="April"
,year="2005"
,address="Canberra, Australia"
,note="Available:
\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf}
[Viewed May 13, 2005]"
,annotation="
Realtime turns into making RCU yet more realtime friendly.
"
}

View File

@ -8,7 +8,7 @@ is that since there is only one CPU, it should not be necessary to
wait for anything else to get done, since there are no other CPUs for
anything else to be happening on. Although this approach will -sort- -of-
work a surprising amount of the time, it is a very bad idea in general.
This document presents two examples that demonstrate exactly how bad an
This document presents three examples that demonstrate exactly how bad an
idea this is.
@ -26,6 +26,9 @@ from softirq, the list scan would find itself referencing a newly freed
element B. This situation can greatly decrease the life expectancy of
your kernel.
This same problem can occur if call_rcu() is invoked from a hardware
interrupt handler.
Example 2: Function-Call Fatality
@ -44,21 +47,73 @@ its arguments would cause it to fail to make the fundamental guarantee
underlying RCU, namely that call_rcu() defers invoking its arguments until
all RCU read-side critical sections currently executing have completed.
Quick Quiz: why is it -not- legal to invoke synchronize_rcu() in
Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
this case?
Example 3: Death by Deadlock
Suppose that call_rcu() is invoked while holding a lock, and that the
callback function must acquire this same lock. In this case, if
call_rcu() were to directly invoke the callback, the result would
be self-deadlock.
In some cases, it would possible to restructure to code so that
the call_rcu() is delayed until after the lock is released. However,
there are cases where this can be quite ugly:
1. If a number of items need to be passed to call_rcu() within
the same critical section, then the code would need to create
a list of them, then traverse the list once the lock was
released.
2. In some cases, the lock will be held across some kernel API,
so that delaying the call_rcu() until the lock is released
requires that the data item be passed up via a common API.
It is far better to guarantee that callbacks are invoked
with no locks held than to have to modify such APIs to allow
arbitrary data items to be passed back up through them.
If call_rcu() directly invokes the callback, painful locking restrictions
or API changes would be required.
Quick Quiz #2: What locking restriction must RCU callbacks respect?
Summary
Permitting call_rcu() to immediately invoke its arguments or permitting
synchronize_rcu() to immediately return breaks RCU, even on a UP system.
So do not do it! Even on a UP system, the RCU infrastructure -must-
respect grace periods.
respect grace periods, and -must- invoke callbacks from a known environment
in which no locks are held.
Answer to Quick Quiz
Answer to Quick Quiz #1:
Why is it -not- legal to invoke synchronize_rcu() in this case?
The calling function is scanning an RCU-protected linked list, and
is therefore within an RCU read-side critical section. Therefore,
the called function has been invoked within an RCU read-side critical
section, and is not permitted to block.
Because the calling function is scanning an RCU-protected linked
list, and is therefore within an RCU read-side critical section.
Therefore, the called function has been invoked within an RCU
read-side critical section, and is not permitted to block.
Answer to Quick Quiz #2:
What locking restriction must RCU callbacks respect?
Any lock that is acquired within an RCU callback must be
acquired elsewhere using an _irq variant of the spinlock
primitive. For example, if "mylock" is acquired by an
RCU callback, then a process-context acquisition of this
lock must use something like spin_lock_irqsave() to
acquire the lock.
If the process-context code were to simply use spin_lock(),
then, since RCU callbacks can be invoked from softirq context,
the callback might be called from a softirq that interrupted
the process-context critical section. This would result in
self-deadlock.
This restriction might seem gratuitous, since very few RCU
callbacks acquire locks directly. However, a great many RCU
callbacks do acquire locks -indirectly-, for example, via
the kfree() primitive.

View File

@ -43,6 +43,10 @@ over a rather long period of time, but improvements are always welcome!
rcu_read_lock_bh()) in the read-side critical sections,
and are also an excellent aid to readability.
As a rough rule of thumb, any dereference of an RCU-protected
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
or by the appropriate update-side lock.
3. Does the update code tolerate concurrent accesses?
The whole point of RCU is to permit readers to run without
@ -90,7 +94,11 @@ over a rather long period of time, but improvements are always welcome!
The rcu_dereference() primitive is used by the various
"_rcu()" list-traversal primitives, such as the
list_for_each_entry_rcu().
list_for_each_entry_rcu(). Note that it is perfectly
legal (if redundant) for update-side code to use
rcu_dereference() and the "_rcu()" list-traversal
primitives. This is particularly useful in code
that is common to readers and updaters.
b. If the list macros are being used, the list_add_tail_rcu()
and list_add_rcu() primitives must be used in order
@ -150,16 +158,9 @@ over a rather long period of time, but improvements are always welcome!
Use of the _rcu() list-traversal primitives outside of an
RCU read-side critical section causes no harm other than
a slight performance degradation on Alpha CPUs and some
confusion on the part of people trying to read the code.
Another way of thinking of this is "If you are holding the
lock that prevents the data structure from changing, why do
you also need RCU-based protection?" That said, there may
well be situations where use of the _rcu() list-traversal
primitives while the update-side lock is held results in
simpler and more maintainable code. The jury is still out
on this question.
a slight performance degradation on Alpha CPUs. It can
also be quite helpful in reducing code bloat when common
code is shared between readers and updaters.
10. Conversely, if you are in an RCU read-side critical section,
you -must- use the "_rcu()" variants of the list macros.

View File

@ -64,6 +64,54 @@ o I hear that RCU is patented? What is with that?
Of these, one was allowed to lapse by the assignee, and the
others have been contributed to the Linux kernel under GPL.
o I hear that RCU needs work in order to support realtime kernels?
Yes, work in progress.
o Where can I find more information on RCU?
See the RTFP.txt file in this directory.
Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
o What are all these files in this directory?
NMI-RCU.txt
Describes how to use RCU to implement dynamic
NMI handlers, which can be revectored on the fly,
without rebooting.
RTFP.txt
List of RCU-related publications and web sites.
UP.txt
Discussion of RCU usage in UP kernels.
arrayRCU.txt
Describes how to use RCU to protect arrays, with
resizeable arrays whose elements reference other
data structures being of the most interest.
checklist.txt
Lists things to check for when inspecting code that
uses RCU.
listRCU.txt
Describes how to use RCU to protect linked lists.
This is the simplest and most common use of RCU
in the Linux kernel.
rcu.txt
You are reading it!
whatisRCU.txt
Overview of how the RCU implementation works. Along
the way, presents a conceptual view of RCU.

View File

@ -0,0 +1,74 @@
Refcounter framework for elements of lists/arrays protected by
RCU.
Refcounting on elements of lists which are protected by traditional
reader/writer spinlocks or semaphores are straight forward as in:
1. 2.
add() search_and_reference()
{ {
alloc_object read_lock(&list_lock);
... search_for_element
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
write_lock(&list_lock); ...
add_element read_unlock(&list_lock);
... ...
write_unlock(&list_lock); }
}
3. 4.
release_referenced() delete()
{ {
... write_lock(&list_lock);
atomic_dec(&el->rc, relfunc) ...
... delete_element
} write_unlock(&list_lock);
...
if (atomic_dec_and_test(&el->rc))
kfree(el);
...
}
If this list/array is made lock free using rcu as in changing the
write_lock in add() and delete() to spin_lock and changing read_lock
in search_and_reference to rcu_read_lock(), the rcuref_get in
search_and_reference could potentially hold reference to an element which
has already been deleted from the list/array. rcuref_lf_get_rcu takes
care of this scenario. search_and_reference should look as;
1. 2.
add() search_and_reference()
{ {
alloc_object rcu_read_lock();
... search_for_element
atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) {
write_lock(&list_lock); rcu_read_unlock();
return FAIL;
add_element }
... ...
write_unlock(&list_lock); rcu_read_unlock();
} }
3. 4.
release_referenced() delete()
{ {
... write_lock(&list_lock);
rcuref_dec(&el->rc, relfunc) ...
... delete_element
} write_unlock(&list_lock);
...
if (rcuref_dec_and_test(&el->rc))
call_rcu(&el->head, el_free);
...
}
Sometimes, reference to the element need to be obtained in the
update (write) stream. In such cases, rcuref_inc_lf might be an overkill
since the spinlock serialising list updates are held. rcuref_inc
is to be used in such cases.
For arches which do not have cmpxchg rcuref_inc_lf
api uses a hashed spinlock implementation and the same hashed spinlock
is acquired in all rcuref_xxx primitives to preserve atomicity.
Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api
might lead to races. rcuref_inc_lf() must be used in lockfree
RCU critical sections only.

View File

@ -0,0 +1,902 @@
What is RCU?
RCU is a synchronization mechanism that was added to the Linux kernel
during the 2.5 development effort that is optimized for read-mostly
situations. Although RCU is actually quite simple once you understand it,
getting there can sometimes be a challenge. Part of the problem is that
most of the past descriptions of RCU have been written with the mistaken
assumption that there is "one true way" to describe RCU. Instead,
the experience has been that different people must take different paths
to arrive at an understanding of RCU. This document provides several
different paths, as follows:
1. RCU OVERVIEW
2. WHAT IS RCU'S CORE API?
3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
6. ANALOGY WITH READER-WRITER LOCKING
7. FULL LIST OF RCU APIs
8. ANSWERS TO QUICK QUIZZES
People who prefer starting with a conceptual overview should focus on
Section 1, though most readers will profit by reading this section at
some point. People who prefer to start with an API that they can then
experiment with should focus on Section 2. People who prefer to start
with example uses should focus on Sections 3 and 4. People who need to
understand the RCU implementation should focus on Section 5, then dive
into the kernel source code. People who reason best by analogy should
focus on Section 6. Section 7 serves as an index to the docbook API
documentation, and Section 8 is the traditional answer key.
So, start with the section that makes the most sense to you and your
preferred method of learning. If you need to know everything about
everything, feel free to read the whole thing -- but if you are really
that type of person, you have perused the source code and will therefore
never need this document anyway. ;-)
1. RCU OVERVIEW
The basic idea behind RCU is to split updates into "removal" and
"reclamation" phases. The removal phase removes references to data items
within a data structure (possibly by replacing them with references to
new versions of these data items), and can run concurrently with readers.
The reason that it is safe to run the removal phase concurrently with
readers is the semantics of modern CPUs guarantee that readers will see
either the old or the new version of the data structure rather than a
partially updated reference. The reclamation phase does the work of reclaiming
(e.g., freeing) the data items removed from the data structure during the
removal phase. Because reclaiming data items can disrupt any readers
concurrently referencing those data items, the reclamation phase must
not start until readers no longer hold references to those data items.
Splitting the update into removal and reclamation phases permits the
updater to perform the removal phase immediately, and to defer the
reclamation phase until all readers active during the removal phase have
completed, either by blocking until they finish or by registering a
callback that is invoked after they finish. Only readers that are active
during the removal phase need be considered, because any reader starting
after the removal phase will be unable to gain a reference to the removed
data items, and therefore cannot be disrupted by the reclamation phase.
So the typical RCU update sequence goes something like the following:
a. Remove pointers to a data structure, so that subsequent
readers cannot gain a reference to it.
b. Wait for all previous readers to complete their RCU read-side
critical sections.
c. At this point, there cannot be any readers who hold references
to the data structure, so it now may safely be reclaimed
(e.g., kfree()d).
Step (b) above is the key idea underlying RCU's deferred destruction.
The ability to wait until all readers are done allows RCU readers to
use much lighter-weight synchronization, in some cases, absolutely no
synchronization at all. In contrast, in more conventional lock-based
schemes, readers must use heavy-weight synchronization in order to
prevent an updater from deleting the data structure out from under them.
This is because lock-based updaters typically update data items in place,
and must therefore exclude readers. In contrast, RCU-based updaters
typically take advantage of the fact that writes to single aligned
pointers are atomic on modern CPUs, allowing atomic insertion, removal,
and replacement of data items in a linked structure without disrupting
readers. Concurrent RCU readers can then continue accessing the old
versions, and can dispense with the atomic operations, memory barriers,
and communications cache misses that are so expensive on present-day
SMP computer systems, even in absence of lock contention.
In the three-step procedure shown above, the updater is performing both
the removal and the reclamation step, but it is often helpful for an
entirely different thread to do the reclamation, as is in fact the case
in the Linux kernel's directory-entry cache (dcache). Even if the same
thread performs both the update step (step (a) above) and the reclamation
step (step (c) above), it is often helpful to think of them separately.
For example, RCU readers and updaters need not communicate at all,
but RCU provides implicit low-overhead communication between readers
and reclaimers, namely, in step (b) above.
So how the heck can a reclaimer tell when a reader is done, given
that readers are not doing any sort of synchronization operations???
Read on to learn about how RCU's API makes this easy.
2. WHAT IS RCU'S CORE API?
The core RCU API is quite small:
a. rcu_read_lock()
b. rcu_read_unlock()
c. synchronize_rcu() / call_rcu()
d. rcu_assign_pointer()
e. rcu_dereference()
There are many other members of the RCU API, but the rest can be
expressed in terms of these five, though most implementations instead
express synchronize_rcu() in terms of the call_rcu() callback API.
The five core RCU APIs are described below, the other 18 will be enumerated
later. See the kernel docbook documentation for more info, or look directly
at the function header comments.
rcu_read_lock()
void rcu_read_lock(void);
Used by a reader to inform the reclaimer that the reader is
entering an RCU read-side critical section. It is illegal
to block while in an RCU read-side critical section, though
kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side
critical sections. Any RCU-protected data structure accessed
during an RCU read-side critical section is guaranteed to remain
unreclaimed for the full duration of that critical section.
Reference counts may be used in conjunction with RCU to maintain
longer-term references to data structures.
rcu_read_unlock()
void rcu_read_unlock(void);
Used by a reader to inform the reclaimer that the reader is
exiting an RCU read-side critical section. Note that RCU
read-side critical sections may be nested and/or overlapping.
synchronize_rcu()
void synchronize_rcu(void);
Marks the end of updater code and the beginning of reclaimer
code. It does this by blocking until all pre-existing RCU
read-side critical sections on all CPUs have completed.
Note that synchronize_rcu() will -not- necessarily wait for
any subsequent RCU read-side critical sections to complete.
For example, consider the following sequence of events:
CPU 0 CPU 1 CPU 2
----------------- ------------------------- ---------------
1. rcu_read_lock()
2. enters synchronize_rcu()
3. rcu_read_lock()
4. rcu_read_unlock()
5. exits synchronize_rcu()
6. rcu_read_unlock()
To reiterate, synchronize_rcu() waits only for ongoing RCU
read-side critical sections to complete, not necessarily for
any that begin after synchronize_rcu() is invoked.
Of course, synchronize_rcu() does not necessarily return
-immediately- after the last pre-existing RCU read-side critical
section completes. For one thing, there might well be scheduling
delays. For another thing, many RCU implementations process
requests in batches in order to improve efficiencies, which can
further delay synchronize_rcu().
Since synchronize_rcu() is the API that must figure out when
readers are done, its implementation is key to RCU. For RCU
to be useful in all but the most read-intensive situations,
synchronize_rcu()'s overhead must also be quite small.
The call_rcu() API is a callback form of synchronize_rcu(),
and is described in more detail in a later section. Instead of
blocking, it registers a function and argument which are invoked
after all ongoing RCU read-side critical sections have completed.
This callback variant is particularly useful in situations where
it is illegal to block.
rcu_assign_pointer()
typeof(p) rcu_assign_pointer(p, typeof(p) v);
Yes, rcu_assign_pointer() -is- implemented as a macro, though it
would be cool to be able to declare a function in this manner.
(Compiler experts will no doubt disagree.)
The updater uses this function to assign a new value to an
RCU-protected pointer, in order to safely communicate the change
in value from the updater to the reader. This function returns
the new value, and also executes any memory-barrier instructions
required for a given CPU architecture.
Perhaps more important, it serves to document which pointers
are protected by RCU. That said, rcu_assign_pointer() is most
frequently used indirectly, via the _rcu list-manipulation
primitives such as list_add_rcu().
rcu_dereference()
typeof(p) rcu_dereference(p);
Like rcu_assign_pointer(), rcu_dereference() must be implemented
as a macro.
The reader uses rcu_dereference() to fetch an RCU-protected
pointer, which returns a value that may then be safely
dereferenced. Note that rcu_deference() does not actually
dereference the pointer, instead, it protects the pointer for
later dereferencing. It also executes any needed memory-barrier
instructions for a given CPU architecture. Currently, only Alpha
needs memory barriers within rcu_dereference() -- on other CPUs,
it compiles to nothing, not even a compiler directive.
Common coding practice uses rcu_dereference() to copy an
RCU-protected pointer to a local variable, then dereferences
this local variable, for example as follows:
p = rcu_dereference(head.next);
return p->data;
However, in this case, one could just as easily combine these
into one statement:
return rcu_dereference(head.next)->data;
If you are going to be fetching multiple fields from the
RCU-protected structure, using the local variable is of
course preferred. Repeated rcu_dereference() calls look
ugly and incur unnecessary overhead on Alpha CPUs.
Note that the value returned by rcu_dereference() is valid
only within the enclosing RCU read-side critical section.
For example, the following is -not- legal:
rcu_read_lock();
p = rcu_dereference(head.next);
rcu_read_unlock();
x = p->address;
rcu_read_lock();
y = p->data;
rcu_read_unlock();
Holding a reference from one RCU read-side critical section
to another is just as illegal as holding a reference from
one lock-based critical section to another! Similarly,
using a reference outside of the critical section in which
it was acquired is just as illegal as doing so with normal
locking.
As with rcu_assign_pointer(), an important function of
rcu_dereference() is to document which pointers are protected
by RCU. And, again like rcu_assign_pointer(), rcu_dereference()
is typically used indirectly, via the _rcu list-manipulation
primitives, such as list_for_each_entry_rcu().
The following diagram shows how each API communicates among the
reader, updater, and reclaimer.
rcu_assign_pointer()
+--------+
+---------------------->| reader |---------+
| +--------+ |
| | |
| | | Protect:
| | | rcu_read_lock()
| | | rcu_read_unlock()
| rcu_dereference() | |
+---------+ | |
| updater |<---------------------+ |
+---------+ V
| +-----------+
+----------------------------------->| reclaimer |
+-----------+
Defer:
synchronize_rcu() & call_rcu()
The RCU infrastructure observes the time sequence of rcu_read_lock(),
rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
order to determine when (1) synchronize_rcu() invocations may return
to their callers and (2) call_rcu() callbacks may be invoked. Efficient
implementations of the RCU infrastructure make heavy use of batching in
order to amortize their overhead over many uses of the corresponding APIs.
There are no fewer than three RCU mechanisms in the Linux kernel; the
diagram above shows the first one, which is by far the most commonly used.
The rcu_dereference() and rcu_assign_pointer() primitives are used for
all three mechanisms, but different defer and protect primitives are
used as follows:
Defer Protect
a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
call_rcu()
b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
c. synchronize_sched() preempt_disable() / preempt_enable()
local_irq_save() / local_irq_restore()
hardirq enter / hardirq exit
NMI enter / NMI exit
These three mechanisms are used as follows:
a. RCU applied to normal data structures.
b. RCU applied to networking data structures that may be subjected
to remote denial-of-service attacks.
c. RCU applied to scheduler and interrupt/NMI-handler tasks.
Again, most uses will be of (a). The (b) and (c) cases are important
for specialized uses, but are relatively uncommon.
3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
This section shows a simple use of the core RCU API to protect a
global pointer to a dynamically allocated structure. More typical
uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt.
struct foo {
int a;
char b;
long c;
};
DEFINE_SPINLOCK(foo_mutex);
struct foo *gbl_foo;
/*
* Create a new struct foo that is the same as the one currently
* pointed to by gbl_foo, except that field "a" is replaced
* with "new_a". Points gbl_foo to the new structure, and
* frees up the old structure after a grace period.
*
* Uses rcu_assign_pointer() to ensure that concurrent readers
* see the initialized version of the new structure.
*
* Uses synchronize_rcu() to ensure that any readers that might
* have references to the old structure complete before freeing
* the old structure.
*/
void foo_update_a(int new_a)
{
struct foo *new_fp;
struct foo *old_fp;
new_fp = kmalloc(sizeof(*fp), GFP_KERNEL);
spin_lock(&foo_mutex);
old_fp = gbl_foo;
*new_fp = *old_fp;
new_fp->a = new_a;
rcu_assign_pointer(gbl_foo, new_fp);
spin_unlock(&foo_mutex);
synchronize_rcu();
kfree(old_fp);
}
/*
* Return the value of field "a" of the current gbl_foo
* structure. Use rcu_read_lock() and rcu_read_unlock()
* to ensure that the structure does not get deleted out
* from under us, and use rcu_dereference() to ensure that
* we see the initialized version of the structure (important
* for DEC Alpha and for people reading the code).
*/
int foo_get_a(void)
{
int retval;
rcu_read_lock();
retval = rcu_dereference(gbl_foo)->a;
rcu_read_unlock();
return retval;
}
So, to sum up:
o Use rcu_read_lock() and rcu_read_unlock() to guard RCU
read-side critical sections.
o Within an RCU read-side critical section, use rcu_dereference()
to dereference RCU-protected pointers.
o Use some solid scheme (such as locks or semaphores) to
keep concurrent updates from interfering with each other.
o Use rcu_assign_pointer() to update an RCU-protected pointer.
This primitive protects concurrent readers from the updater,
-not- concurrent updates from each other! You therefore still
need to use locking (or something similar) to keep concurrent
rcu_assign_pointer() primitives from interfering with each other.
o Use synchronize_rcu() -after- removing a data element from an
RCU-protected data structure, but -before- reclaiming/freeing
the data element, in order to wait for the completion of all
RCU read-side critical sections that might be referencing that
data item.
See checklist.txt for additional rules to follow when using RCU.
4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
In the example above, foo_update_a() blocks until a grace period elapses.
This is quite simple, but in some cases one cannot afford to wait so
long -- there might be other high-priority work to be done.
In such cases, one uses call_rcu() rather than synchronize_rcu().
The call_rcu() API is as follows:
void call_rcu(struct rcu_head * head,
void (*func)(struct rcu_head *head));
This function invokes func(head) after a grace period has elapsed.
This invocation might happen from either softirq or process context,
so the function is not permitted to block. The foo struct needs to
have an rcu_head structure added, perhaps as follows:
struct foo {
int a;
char b;
long c;
struct rcu_head rcu;
};
The foo_update_a() function might then be written as follows:
/*
* Create a new struct foo that is the same as the one currently
* pointed to by gbl_foo, except that field "a" is replaced
* with "new_a". Points gbl_foo to the new structure, and
* frees up the old structure after a grace period.
*
* Uses rcu_assign_pointer() to ensure that concurrent readers
* see the initialized version of the new structure.
*
* Uses call_rcu() to ensure that any readers that might have
* references to the old structure complete before freeing the
* old structure.
*/
void foo_update_a(int new_a)
{
struct foo *new_fp;
struct foo *old_fp;
new_fp = kmalloc(sizeof(*fp), GFP_KERNEL);
spin_lock(&foo_mutex);
old_fp = gbl_foo;
*new_fp = *old_fp;
new_fp->a = new_a;
rcu_assign_pointer(gbl_foo, new_fp);
spin_unlock(&foo_mutex);
call_rcu(&old_fp->rcu, foo_reclaim);
}
The foo_reclaim() function might appear as follows:
void foo_reclaim(struct rcu_head *rp)
{
struct foo *fp = container_of(rp, struct foo, rcu);
kfree(fp);
}
The container_of() primitive is a macro that, given a pointer into a
struct, the type of the struct, and the pointed-to field within the
struct, returns a pointer to the beginning of the struct.
The use of call_rcu() permits the caller of foo_update_a() to
immediately regain control, without needing to worry further about the
old version of the newly updated element. It also clearly shows the
RCU distinction between updater, namely foo_update_a(), and reclaimer,
namely foo_reclaim().
The summary of advice is the same as for the previous section, except
that we are now using call_rcu() rather than synchronize_rcu():
o Use call_rcu() -after- removing a data element from an
RCU-protected data structure in order to register a callback
function that will be invoked after the completion of all RCU
read-side critical sections that might be referencing that
data item.
Again, see checklist.txt for additional rules governing the use of RCU.
5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
One of the nice things about RCU is that it has extremely simple "toy"
implementations that are a good first step towards understanding the
production-quality implementations in the Linux kernel. This section
presents two such "toy" implementations of RCU, one that is implemented
in terms of familiar locking primitives, and another that more closely
resembles "classic" RCU. Both are way too simple for real-world use,
lacking both functionality and performance. However, they are useful
in getting a feel for how RCU works. See kernel/rcupdate.c for a
production-quality implementation, and see:
http://www.rdrop.com/users/paulmck/RCU
for papers describing the Linux kernel RCU implementation. The OLS'01
and OLS'02 papers are a good introduction, and the dissertation provides
more details on the current implementation.
5A. "TOY" IMPLEMENTATION #1: LOCKING
This section presents a "toy" RCU implementation that is based on
familiar locking primitives. Its overhead makes it a non-starter for
real-life use, as does its lack of scalability. It is also unsuitable
for realtime use, since it allows scheduling latency to "bleed" from
one read-side critical section to another.
However, it is probably the easiest implementation to relate to, so is
a good starting point.
It is extremely simple:
static DEFINE_RWLOCK(rcu_gp_mutex);
void rcu_read_lock(void)
{
read_lock(&rcu_gp_mutex);
}
void rcu_read_unlock(void)
{
read_unlock(&rcu_gp_mutex);
}
void synchronize_rcu(void)
{
write_lock(&rcu_gp_mutex);
write_unlock(&rcu_gp_mutex);
}
[You can ignore rcu_assign_pointer() and rcu_dereference() without
missing much. But here they are anyway. And whatever you do, don't
forget about them when submitting patches making use of RCU!]
#define rcu_assign_pointer(p, v) ({ \
smp_wmb(); \
(p) = (v); \
})
#define rcu_dereference(p) ({ \
typeof(p) _________p1 = p; \
smp_read_barrier_depends(); \
(_________p1); \
})
The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
and release a global reader-writer lock. The synchronize_rcu()
primitive write-acquires this same lock, then immediately releases
it. This means that once synchronize_rcu() exits, all RCU read-side
critical sections that were in progress before synchonize_rcu() was
called are guaranteed to have completed -- there is no way that
synchronize_rcu() would have been able to write-acquire the lock
otherwise.
It is possible to nest rcu_read_lock(), since reader-writer locks may
be recursively acquired. Note also that rcu_read_lock() is immune
from deadlock (an important property of RCU). The reason for this is
that the only thing that can block rcu_read_lock() is a synchronize_rcu().
But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
so there can be no deadlock cycle.
Quick Quiz #1: Why is this argument naive? How could a deadlock
occur when using this algorithm in a real-world Linux
kernel? How could this deadlock be avoided?
5B. "TOY" EXAMPLE #2: CLASSIC RCU
This section presents a "toy" RCU implementation that is based on
"classic RCU". It is also short on performance (but only for updates) and
on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
are the same as those shown in the preceding section, so they are omitted.
void rcu_read_lock(void) { }
void rcu_read_unlock(void) { }
void synchronize_rcu(void)
{
int cpu;
for_each_cpu(cpu)
run_on(cpu);
}
Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
This is the great strength of classic RCU in a non-preemptive kernel:
read-side overhead is precisely zero, at least on non-Alpha CPUs.
And there is absolutely no way that rcu_read_lock() can possibly
participate in a deadlock cycle!
The implementation of synchronize_rcu() simply schedules itself on each
CPU in turn. The run_on() primitive can be implemented straightforwardly
in terms of the sched_setaffinity() primitive. Of course, a somewhat less
"toy" implementation would restore the affinity upon completion rather
than just leaving all tasks running on the last CPU, but when I said
"toy", I meant -toy-!
So how the heck is this supposed to work???
Remember that it is illegal to block while in an RCU read-side critical
section. Therefore, if a given CPU executes a context switch, we know
that it must have completed all preceding RCU read-side critical sections.
Once -all- CPUs have executed a context switch, then -all- preceding
RCU read-side critical sections will have completed.
So, suppose that we remove a data item from its structure and then invoke
synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
that there are no RCU read-side critical sections holding a reference
to that data item, so we can safely reclaim it.
Quick Quiz #2: Give an example where Classic RCU's read-side
overhead is -negative-.
Quick Quiz #3: If it is illegal to block in an RCU read-side
critical section, what the heck do you do in
PREEMPT_RT, where normal spinlocks can block???
6. ANALOGY WITH READER-WRITER LOCKING
Although RCU can be used in many different ways, a very common use of
RCU is analogous to reader-writer locking. The following unified
diff shows how closely related RCU and reader-writer locking can be.
@@ -13,15 +14,15 @@
struct list_head *lp;
struct el *p;
- read_lock();
- list_for_each_entry(p, head, lp) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(p, head, lp) {
if (p->key == key) {
*result = p->data;
- read_unlock();
+ rcu_read_unlock();
return 1;
}
}
- read_unlock();
+ rcu_read_unlock();
return 0;
}
@@ -29,15 +30,16 @@
{
struct el *p;
- write_lock(&listmutex);
+ spin_lock(&listmutex);
list_for_each_entry(p, head, lp) {
if (p->key == key) {
list_del(&p->list);
- write_unlock(&listmutex);
+ spin_unlock(&listmutex);
+ synchronize_rcu();
kfree(p);
return 1;
}
}
- write_unlock(&listmutex);
+ spin_unlock(&listmutex);
return 0;
}
Or, for those who prefer a side-by-side listing:
1 struct el { 1 struct el {
2 struct list_head list; 2 struct list_head list;
3 long key; 3 long key;
4 spinlock_t mutex; 4 spinlock_t mutex;
5 int data; 5 int data;
6 /* Other data fields */ 6 /* Other data fields */
7 }; 7 };
8 spinlock_t listmutex; 8 spinlock_t listmutex;
9 struct el head; 9 struct el head;
1 int search(long key, int *result) 1 int search(long key, int *result)
2 { 2 {
3 struct list_head *lp; 3 struct list_head *lp;
4 struct el *p; 4 struct el *p;
5 5
6 read_lock(); 6 rcu_read_lock();
7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
8 if (p->key == key) { 8 if (p->key == key) {
9 *result = p->data; 9 *result = p->data;
10 read_unlock(); 10 rcu_read_unlock();
11 return 1; 11 return 1;
12 } 12 }
13 } 13 }
14 read_unlock(); 14 rcu_read_unlock();
15 return 0; 15 return 0;
16 } 16 }
1 int delete(long key) 1 int delete(long key)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 write_lock(&listmutex); 5 spin_lock(&listmutex);
6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 list_del(&p->list); 8 list_del(&p->list);
9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
10 synchronize_rcu();
10 kfree(p); 11 kfree(p);
11 return 1; 12 return 1;
12 } 13 }
13 } 14 }
14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
15 return 0; 16 return 0;
16 } 17 }
Either way, the differences are quite small. Read-side locking moves
to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
from a reader-writer lock to a simple spinlock, and a synchronize_rcu()
precedes the kfree().
However, there is one potential catch: the read-side and update-side
critical sections can now run concurrently. In many cases, this will
not be a problem, but it is necessary to check carefully regardless.
For example, if multiple independent list updates must be seen as
a single atomic update, converting to RCU will require special care.
Also, the presence of synchronize_rcu() means that the RCU version of
delete() can now block. If this is a problem, there is a callback-based
mechanism that never blocks, namely call_rcu(), that can be used in
place of synchronize_rcu().
7. FULL LIST OF RCU APIs
The RCU APIs are documented in docbook-format header comments in the
Linux-kernel source code, but it helps to have a full list of the
APIs, since there does not appear to be a way to categorize them
in docbook. Here is the list, by category.
Markers for RCU read-side critical sections:
rcu_read_lock
rcu_read_unlock
rcu_read_lock_bh
rcu_read_unlock_bh
RCU pointer/list traversal:
rcu_dereference
list_for_each_rcu (to be deprecated in favor of
list_for_each_entry_rcu)
list_for_each_safe_rcu (deprecated, not used)
list_for_each_entry_rcu
list_for_each_continue_rcu (to be deprecated in favor of new
list_for_each_entry_continue_rcu)
hlist_for_each_rcu (to be deprecated in favor of
hlist_for_each_entry_rcu)
hlist_for_each_entry_rcu
RCU pointer update:
rcu_assign_pointer
list_add_rcu
list_add_tail_rcu
list_del_rcu
list_replace_rcu
hlist_del_rcu
hlist_add_head_rcu
RCU grace period:
synchronize_kernel (deprecated)
synchronize_net
synchronize_sched
synchronize_rcu
call_rcu
call_rcu_bh
See the comment headers in the source code (or the docbook generated
from them) for more information.
8. ANSWERS TO QUICK QUIZZES
Quick Quiz #1: Why is this argument naive? How could a deadlock
occur when using this algorithm in a real-world Linux
kernel? [Referring to the lock-based "toy" RCU
algorithm.]
Answer: Consider the following sequence of events:
1. CPU 0 acquires some unrelated lock, call it
"problematic_lock".
2. CPU 1 enters synchronize_rcu(), write-acquiring
rcu_gp_mutex.
3. CPU 0 enters rcu_read_lock(), but must wait
because CPU 1 holds rcu_gp_mutex.
4. CPU 1 is interrupted, and the irq handler
attempts to acquire problematic_lock.
The system is now deadlocked.
One way to avoid this deadlock is to use an approach like
that of CONFIG_PREEMPT_RT, where all normal spinlocks
become blocking locks, and all irq handlers execute in
the context of special tasks. In this case, in step 4
above, the irq handler would block, allowing CPU 1 to
release rcu_gp_mutex, avoiding the deadlock.
Even in the absence of deadlock, this RCU implementation
allows latency to "bleed" from readers to other
readers through synchronize_rcu(). To see this,
consider task A in an RCU read-side critical section
(thus read-holding rcu_gp_mutex), task B blocked
attempting to write-acquire rcu_gp_mutex, and
task C blocked in rcu_read_lock() attempting to
read_acquire rcu_gp_mutex. Task A's RCU read-side
latency is holding up task C, albeit indirectly via
task B.
Realtime RCU implementations therefore use a counter-based
approach where tasks in RCU read-side critical sections
cannot be blocked by tasks executing synchronize_rcu().
Quick Quiz #2: Give an example where Classic RCU's read-side
overhead is -negative-.
Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT
kernel where a routing table is used by process-context
code, but can be updated by irq-context code (for example,
by an "ICMP REDIRECT" packet). The usual way of handling
this would be to have the process-context code disable
interrupts while searching the routing table. Use of
RCU allows such interrupt-disabling to be dispensed with.
Thus, without RCU, you pay the cost of disabling interrupts,
and with RCU you don't.
One can argue that the overhead of RCU in this
case is negative with respect to the single-CPU
interrupt-disabling approach. Others might argue that
the overhead of RCU is merely zero, and that replacing
the positive overhead of the interrupt-disabling scheme
with the zero-overhead RCU scheme does not constitute
negative overhead.
In real life, of course, things are more complex. But
even the theoretical possibility of negative overhead for
a synchronization primitive is a bit unexpected. ;-)
Quick Quiz #3: If it is illegal to block in an RCU read-side
critical section, what the heck do you do in
PREEMPT_RT, where normal spinlocks can block???
Answer: Just as PREEMPT_RT permits preemption of spinlock
critical sections, it permits preemption of RCU
read-side critical sections. It also permits
spinlocks blocking while in RCU read-side critical
sections.
Why the apparent inconsistency? Because it is it
possible to use priority boosting to keep the RCU
grace periods short if need be (for example, if running
short of memory). In contrast, if blocking waiting
for (say) network reception, there is no way to know
what should be boosted. Especially given that the
process we need to boost might well be a human being
who just went out for a pizza or something. And although
a computer-operated cattle prod might arouse serious
interest, it might also provoke serious objections.
Besides, how does the computer know what pizza parlor
the human being went to???
ACKNOWLEDGEMENTS
My thanks to the people who helped make this human-readable, including
Jon Walpole, Josh Triplett, Serge Hallyn, and Suzanne Wood.
For more information, see http://www.rdrop.com/users/paulmck/RCU.

View File

@ -33,3 +33,6 @@ The result of the execution of this aml method is
attached to /proc/acpi/hotkey/poll_method, which is dnyamically
created. Please use command "cat /proc/acpi/hotkey/polling_method"
to retrieve it.
Note: Use cmdline "acpi_generic_hotkey" to over-ride
platform-specific with generic driver.

View File

@ -8,13 +8,15 @@ fi
n_partitions=${n_partitions:-16}
dir=$1
shelf=$2
nslots=16
maxslot=`echo $nslots 1 - p | dc`
MAJOR=152
set -e
minor=`echo 10 \* $shelf \* $n_partitions | bc`
minor=`echo $nslots \* $shelf \* $n_partitions | bc`
endp=`echo $n_partitions - 1 | bc`
for slot in `seq 0 9`; do
for slot in `seq 0 $maxslot`; do
for part in `seq 0 $endp`; do
name=e$shelf.$slot
test "$part" != "0" && name=${name}p$part

View File

@ -0,0 +1,439 @@
Applying Patches To The Linux Kernel
------------------------------------
(Written by Jesper Juhl, August 2005)
A frequently asked question on the Linux Kernel Mailing List is how to apply
a patch to the kernel or, more specifically, what base kernel a patch for
one of the many trees/branches should be applied to. Hopefully this document
will explain this to you.
In addition to explaining how to apply and revert patches, a brief
description of the different kernel trees (and examples of how to apply
their specific patches) is also provided.
What is a patch?
---
A patch is a small text document containing a delta of changes between two
different versions of a source tree. Patches are created with the `diff'
program.
To correctly apply a patch you need to know what base it was generated from
and what new version the patch will change the source tree into. These
should both be present in the patch file metadata or be possible to deduce
from the filename.
How do I apply or revert a patch?
---
You apply a patch with the `patch' program. The patch program reads a diff
(or patch) file and makes the changes to the source tree described in it.
Patches for the Linux kernel are generated relative to the parent directory
holding the kernel source dir.
This means that paths to files inside the patch file contain the name of the
kernel source directories it was generated against (or some other directory
names like "a/" and "b/").
Since this is unlikely to match the name of the kernel source dir on your
local machine (but is often useful info to see what version an otherwise
unlabeled patch was generated against) you should change into your kernel
source directory and then strip the first element of the path from filenames
in the patch file when applying it (the -p1 argument to `patch' does this).
To revert a previously applied patch, use the -R argument to patch.
So, if you applied a patch like this:
patch -p1 < ../patch-x.y.z
You can revert (undo) it like this:
patch -R -p1 < ../patch-x.y.z
How do I feed a patch/diff file to `patch'?
---
This (as usual with Linux and other UNIX like operating systems) can be
done in several different ways.
In all the examples below I feed the file (in uncompressed form) to patch
via stdin using the following syntax:
patch -p1 < path/to/patch-x.y.z
If you just want to be able to follow the examples below and don't want to
know of more than one way to use patch, then you can stop reading this
section here.
Patch can also get the name of the file to use via the -i argument, like
this:
patch -p1 -i path/to/patch-x.y.z
If your patch file is compressed with gzip or bzip2 and you don't want to
uncompress it before applying it, then you can feed it to patch like this
instead:
zcat path/to/patch-x.y.z.gz | patch -p1
bzcat path/to/patch-x.y.z.bz2 | patch -p1
If you wish to uncompress the patch file by hand first before applying it
(what I assume you've done in the examples below), then you simply run
gunzip or bunzip2 on the file - like this:
gunzip patch-x.y.z.gz
bunzip2 patch-x.y.z.bz2
Which will leave you with a plain text patch-x.y.z file that you can feed to
patch via stdin or the -i argument, as you prefer.
A few other nice arguments for patch are -s which causes patch to be silent
except for errors which is nice to prevent errors from scrolling out of the
screen too fast, and --dry-run which causes patch to just print a listing of
what would happen, but doesn't actually make any changes. Finally --verbose
tells patch to print more information about the work being done.
Common errors when patching
---
When patch applies a patch file it attempts to verify the sanity of the
file in different ways.
Checking that the file looks like a valid patch file, checking the code
around the bits being modified matches the context provided in the patch are
just two of the basic sanity checks patch does.
If patch encounters something that doesn't look quite right it has two
options. It can either refuse to apply the changes and abort or it can try
to find a way to make the patch apply with a few minor changes.
One example of something that's not 'quite right' that patch will attempt to
fix up is if all the context matches, the lines being changed match, but the
line numbers are different. This can happen, for example, if the patch makes
a change in the middle of the file but for some reasons a few lines have
been added or removed near the beginning of the file. In that case
everything looks good it has just moved up or down a bit, and patch will
usually adjust the line numbers and apply the patch.
Whenever patch applies a patch that it had to modify a bit to make it fit
it'll tell you about it by saying the patch applied with 'fuzz'.
You should be wary of such changes since even though patch probably got it
right it doesn't /always/ get it right, and the result will sometimes be
wrong.
When patch encounters a change that it can't fix up with fuzz it rejects it
outright and leaves a file with a .rej extension (a reject file). You can
read this file to see exactely what change couldn't be applied, so you can
go fix it up by hand if you wish.
If you don't have any third party patches applied to your kernel source, but
only patches from kernel.org and you apply the patches in the correct order,
and have made no modifications yourself to the source files, then you should
never see a fuzz or reject message from patch. If you do see such messages
anyway, then there's a high risk that either your local source tree or the
patch file is corrupted in some way. In that case you should probably try
redownloading the patch and if things are still not OK then you'd be advised
to start with a fresh tree downloaded in full from kernel.org.
Let's look a bit more at some of the messages patch can produce.
If patch stops and presents a "File to patch:" prompt, then patch could not
find a file to be patched. Most likely you forgot to specify -p1 or you are
in the wrong directory. Less often, you'll find patches that need to be
applied with -p0 instead of -p1 (reading the patch file should reveal if
this is the case - if so, then this is an error by the person who created
the patch but is not fatal).
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
message similar to that, then it means that patch had to adjust the location
of the change (in this example it needed to move 7 lines from where it
expected to make the change to make it fit).
The resulting file may or may not be OK, depending on the reason the file
was different than expected.
This often happens if you try to apply a patch that was generated against a
different kernel version than the one you are trying to patch.
If you get a message like "Hunk #3 FAILED at 2387.", then it means that the
patch could not be applied correctly and the patch program was unable to
fuzz its way through. This will generate a .rej file with the change that
caused the patch to fail and also a .orig file showing you the original
content that couldn't be changed.
If you get "Reversed (or previously applied) patch detected! Assume -R? [n]"
then patch detected that the change contained in the patch seems to have
already been made.
If you actually did apply this patch previously and you just re-applied it
in error, then just say [n]o and abort this patch. If you applied this patch
previously and actually intended to revert it, but forgot to specify -R,
then you can say [y]es here to make patch revert it for you.
This can also happen if the creator of the patch reversed the source and
destination directories when creating the patch, and in that case reverting
the patch will in fact apply it.
A message similar to "patch: **** unexpected end of file in patch" or "patch
unexpectedly ends in middle of line" means that patch could make no sense of
the file you fed to it. Either your download is broken or you tried to feed
patch a compressed patch file without uncompressing it first.
As I already mentioned above, these errors should never happen if you apply
a patch from kernel.org to the correct version of an unmodified source tree.
So if you get these errors with kernel.org patches then you should probably
assume that either your patch file or your tree is broken and I'd advice you
to start over with a fresh download of a full kernel tree and the patch you
wish to apply.
Are there any alternatives to `patch'?
---
Yes there are alternatives. You can use the `interdiff' program
(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
differences between two patches and then apply the result.
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
step. The -z flag to interdiff will even let you feed it patches in gzip or
bzip2 compressed form directly without the use of zcat or bzcat or manual
decompression.
Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step:
interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1
Although interdiff may save you a step or two you are generally advised to
do the additional steps since interdiff can get things wrong in some cases.
Another alternative is `ketchup', which is a python script for automatic
downloading and applying of patches (http://www.selenic.com/ketchup/).
Other nice tools are diffstat which shows a summary of changes made by a
patch, lsdiff which displays a short listing of affected files in a patch
file, along with (optionally) the line numbers of the start of each patch
and grepdiff which displays a list of the files modified by a patch where
the patch contains a given regular expression.
Where can I download the patches?
---
The patches are available at http://kernel.org/
Most recent patches are linked from the front page, but they also have
specific homes.
The 2.6.x.y (-stable) and 2.6.x patches live at
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
The -rc patches live at
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/
The -git patches live at
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/
The -mm kernels live at
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
country code. This way you'll be downloading from a mirror site that's most
likely geographically closer to you, resulting in faster downloads for you,
less bandwidth used globally and less load on the main kernel.org servers -
these are good things, do use mirrors when possible.
The 2.6.x kernels
---
These are the base stable releases released by Linus. The highest numbered
release is the most recent.
If regressions or other serious flaws are found then a -stable fix patch
will be released (see below) on top of this base. Once a new 2.6.x base
kernel is released, a patch is made available that is a delta between the
previous 2.6.x kernel and the new one.
To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to
first revert the 2.6.x.y patch).
Here are some examples:
# moving from 2.6.11 to 2.6.12
$ cd ~/linux-2.6.11 # change to kernel source dir
$ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch
$ cd ..
$ mv linux-2.6.11 linux-2.6.12 # rename source dir
# moving from 2.6.11.1 to 2.6.12
$ cd ~/linux-2.6.11.1 # change to kernel source dir
$ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
# source dir is now 2.6.11
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
$ cd ..
$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir
The 2.6.x.y kernels
---
Kernels with 4 digit versions are -stable kernels. They contain small(ish)
critical fixes for security problems or significant regressions discovered
in a given 2.6.x kernel.
This is the recommended branch for users who want the most recent stable
kernel and are not interested in helping test development/experimental
versions.
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
the current stable kernel.
These patches are not incremental, meaning that for example the 2.6.12.3
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
of the base 2.6.12 kernel source.
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
source you have to first back out the 2.6.12.2 patch (so you are left with a
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
Here's a small example:
$ cd ~/linux-2.6.12.2 # change into the kernel source dir
$ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch
$ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch
$ cd ..
$ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir
The -rc kernels
---
These are release-candidate kernels. These are development kernels released
by Linus whenever he deems the current git (the kernel's source management
tool) tree to be in a reasonably sane state adequate for testing.
These kernels are not stable and you should expect occasional breakage if
you intend to run them. This is however the most stable of the main
development branches and is also what will eventually turn into the next
stable kernel, so it is important that it be tested by as many people as
possible.
This is a good branch to run for people who want to help out testing
development kernels but do not want to run some of the really experimental
stuff (such people should see the sections about -git and -mm kernels below).
The -rc patches are not incremental, they apply to a base 2.6.x kernel, just
like the 2.6.x.y patches described above. The kernel version before the -rcN
suffix denotes the version of the kernel that this -rc kernel will eventually
turn into.
So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13
kernel and the patch should be applied on top of the 2.6.12 kernel source.
Here are 3 examples of how to apply these patches:
# first an example of moving from 2.6.12 to 2.6.13-rc3
$ cd ~/linux-2.6.12 # change into the 2.6.12 source dir
$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
$ cd ..
$ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir
# now let's move from 2.6.13-rc3 to 2.6.13-rc5
$ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir
$ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch
$ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch
$ cd ..
$ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir
# finally let's try and move from 2.6.12.3 to 2.6.13-rc5
$ cd ~/linux-2.6.12.3 # change to the kernel source dir
$ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch
$ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch
$ cd ..
$ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir
The -git kernels
---
These are daily snapshots of Linus' kernel tree (managed in a git
repository, hence the name).
These patches are usually released daily and represent the current state of
Linus' tree. They are more experimental than -rc kernels since they are
generated automatically without even a cursory glance to see if they are
sane.
-git patches are not incremental and apply either to a base 2.6.x kernel or
a base 2.6.x-rc kernel - you can see which from their name.
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
Here are some examples of how to apply these patches:
# moving from 2.6.12 to 2.6.12-git1
$ cd ~/linux-2.6.12 # change to the kernel source dir
$ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch
$ cd ..
$ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir
# moving from 2.6.12-git1 to 2.6.13-rc2-git3
$ cd ~/linux-2.6.12-git1 # change to the kernel source dir
$ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch
# we now have a 2.6.12 kernel
$ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch
# the kernel is now 2.6.13-rc2
$ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch
# the kernel is now 2.6.13-rc2-git3
$ cd ..
$ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir
The -mm kernels
---
These are experimental kernels released by Andrew Morton.
The -mm tree serves as a sort of proving ground for new features and other
experimental patches.
Once a patch has proved its worth in -mm for a while Andrew pushes it on to
Linus for inclusion in mainline.
Although it's encouraged that patches flow to Linus via the -mm tree, this
is not always enforced.
Subsystem maintainers (or individuals) sometimes push their patches directly
to Linus, even though (or after) they have been merged and tested in -mm (or
sometimes even without prior testing in -mm).
You should generally strive to get your patches into mainline via -mm to
ensure maximum testing.
This branch is in constant flux and contains many experimental features, a
lot of debugging patches not appropriate for mainline etc and is the most
experimental of the branches described in this document.
These kernels are not appropriate for use on systems that are supposed to be
stable and they are more risky to run than any of the other branches (make
sure you have up-to-date backups - that goes for any experimental kernel but
even more so for -mm kernels).
These kernels in addition to all the other experimental patches they contain
usually also contain any changes in the mainline -git kernels available at
the time of release.
Testing of -mm kernels is greatly appreciated since the whole point of the
tree is to weed out regressions, crashes, data corruption bugs, build
breakage (and any other bug in general) before changes are merged into the
more stable mainline Linus tree.
But testers of -mm should be aware that breakage in this tree is more common
than in any other tree.
The -mm kernels are not released on a fixed schedule, but usually a few -mm
kernels are released in between each -rc kernel (1 to 3 is common).
The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels
have been released yet) or to a Linus -rc kernel.
Here are some examples of applying the -mm patches:
# moving from 2.6.12 to 2.6.12-mm1
$ cd ~/linux-2.6.12 # change to the 2.6.12 source dir
$ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch
$ cd ..
$ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately
# moving from 2.6.12-mm1 to 2.6.13-rc3-mm3
$ cd ~/linux-2.6.12-mm1
$ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch
# we now have a 2.6.12 source
$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
# we now have a 2.6.13-rc3 source
$ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch
$ cd ..
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
This concludes this list of explanations of the various kernel trees and I
hope you are now crystal clear on how to apply the various patches and help
testing the kernel.

View File

@ -68,7 +68,8 @@ it a better device citizen. Further thanks to Joel Katz
Porfiri Claudio <C.Porfiri@nisms.tei.ericsson.se> for patches
to make the driver work with the older CDU-510/515 series, and
Heiko Eissfeldt <heiko@colossus.escape.de> for pointing out that
the verify_area() checks were ignoring the results of said checks.
the verify_area() checks were ignoring the results of said checks
(note: verify_area() has since been replaced by access_ok()).
(Acknowledgments from Ron Jeppesen in the 0.3 release:)
Thanks to Corey Minyard who wrote the original CDU-31A driver on which

View File

@ -0,0 +1,194 @@
/*
* cn_test.c
*
* 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
* All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/skbuff.h>
#include <linux/timer.h>
#include "connector.h"
static struct cb_id cn_test_id = { 0x123, 0x456 };
static char cn_test_name[] = "cn_test";
static struct sock *nls;
static struct timer_list cn_test_timer;
void cn_test_callback(void *data)
{
struct cn_msg *msg = (struct cn_msg *)data;
printk("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n",
__func__, jiffies, msg->id.idx, msg->id.val,
msg->seq, msg->ack, msg->len, (char *)msg->data);
}
static int cn_test_want_notify(void)
{
struct cn_ctl_msg *ctl;
struct cn_notify_req *req;
struct cn_msg *msg = NULL;
int size, size0;
struct sk_buff *skb;
struct nlmsghdr *nlh;
u32 group = 1;
size0 = sizeof(*msg) + sizeof(*ctl) + 3 * sizeof(*req);
size = NLMSG_SPACE(size0);
skb = alloc_skb(size, GFP_ATOMIC);
if (!skb) {
printk(KERN_ERR "Failed to allocate new skb with size=%u.\n",
size);
return -ENOMEM;
}
nlh = NLMSG_PUT(skb, 0, 0x123, NLMSG_DONE, size - sizeof(*nlh));
msg = (struct cn_msg *)NLMSG_DATA(nlh);
memset(msg, 0, size0);
msg->id.idx = -1;
msg->id.val = -1;
msg->seq = 0x123;
msg->ack = 0x345;
msg->len = size0 - sizeof(*msg);
ctl = (struct cn_ctl_msg *)(msg + 1);
ctl->idx_notify_num = 1;
ctl->val_notify_num = 2;
ctl->group = group;
ctl->len = msg->len - sizeof(*ctl);
req = (struct cn_notify_req *)(ctl + 1);
/*
* Idx.
*/
req->first = cn_test_id.idx;
req->range = 10;
/*
* Val 0.
*/
req++;
req->first = cn_test_id.val;
req->range = 10;
/*
* Val 1.
*/
req++;
req->first = cn_test_id.val + 20;
req->range = 10;
NETLINK_CB(skb).dst_groups = ctl->group;
//netlink_broadcast(nls, skb, 0, ctl->group, GFP_ATOMIC);
netlink_unicast(nls, skb, 0, 0);
printk(KERN_INFO "Request was sent. Group=0x%x.\n", ctl->group);
return 0;
nlmsg_failure:
printk(KERN_ERR "Failed to send %u.%u\n", msg->seq, msg->ack);
kfree_skb(skb);
return -EINVAL;
}
static u32 cn_test_timer_counter;
static void cn_test_timer_func(unsigned long __data)
{
struct cn_msg *m;
char data[32];
m = kmalloc(sizeof(*m) + sizeof(data), GFP_ATOMIC);
if (m) {
memset(m, 0, sizeof(*m) + sizeof(data));
memcpy(&m->id, &cn_test_id, sizeof(m->id));
m->seq = cn_test_timer_counter;
m->len = sizeof(data);
m->len =
scnprintf(data, sizeof(data), "counter = %u",
cn_test_timer_counter) + 1;
memcpy(m + 1, data, m->len);
cn_netlink_send(m, 0, gfp_any());
kfree(m);
}
cn_test_timer_counter++;
mod_timer(&cn_test_timer, jiffies + HZ);
}
static int cn_test_init(void)
{
int err;
err = cn_add_callback(&cn_test_id, cn_test_name, cn_test_callback);
if (err)
goto err_out;
cn_test_id.val++;
err = cn_add_callback(&cn_test_id, cn_test_name, cn_test_callback);
if (err) {
cn_del_callback(&cn_test_id);
goto err_out;
}
init_timer(&cn_test_timer);
cn_test_timer.function = cn_test_timer_func;
cn_test_timer.expires = jiffies + HZ;
cn_test_timer.data = 0;
add_timer(&cn_test_timer);
return 0;
err_out:
if (nls && nls->sk_socket)
sock_release(nls->sk_socket);
return err;
}
static void cn_test_fini(void)
{
del_timer_sync(&cn_test_timer);
cn_del_callback(&cn_test_id);
cn_test_id.val--;
cn_del_callback(&cn_test_id);
if (nls && nls->sk_socket)
sock_release(nls->sk_socket);
}
module_init(cn_test_init);
module_exit(cn_test_fini);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
MODULE_DESCRIPTION("Connector's test module");

View File

@ -0,0 +1,133 @@
/*****************************************/
Kernel Connector.
/*****************************************/
Kernel connector - new netlink based userspace <-> kernel space easy
to use communication module.
Connector driver adds possibility to connect various agents using
netlink based network. One must register callback and
identifier. When driver receives special netlink message with
appropriate identifier, appropriate callback will be called.
From the userspace point of view it's quite straightforward:
socket();
bind();
send();
recv();
But if kernelspace want to use full power of such connections, driver
writer must create special sockets, must know about struct sk_buff
handling... Connector allows any kernelspace agents to use netlink
based networking for inter-process communication in a significantly
easier way:
int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *));
void cn_netlink_send(struct cn_msg *msg, u32 __group, int gfp_mask);
struct cb_id
{
__u32 idx;
__u32 val;
};
idx and val are unique identifiers which must be registered in
connector.h for in-kernel usage. void (*callback) (void *) - is a
callback function which will be called when message with above idx.val
will be received by connector core. Argument for that function must
be dereferenced to struct cn_msg *.
struct cn_msg
{
struct cb_id id;
__u32 seq;
__u32 ack;
__u32 len; /* Length of the following data */
__u8 data[0];
};
/*****************************************/
Connector interfaces.
/*****************************************/
int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *));
Registers new callback with connector core.
struct cb_id *id - unique connector's user identifier.
It must be registered in connector.h for legal in-kernel users.
char *name - connector's callback symbolic name.
void (*callback) (void *) - connector's callback.
Argument must be dereferenced to struct cn_msg *.
void cn_del_callback(struct cb_id *id);
Unregisters new callback with connector core.
struct cb_id *id - unique connector's user identifier.
void cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask);
Sends message to the specified groups. It can be safely called from
any context, but may silently fail under strong memory pressure.
struct cn_msg * - message header(with attached data).
u32 __group - destination group.
If __group is zero, then appropriate group will
be searched through all registered connector users,
and message will be delivered to the group which was
created for user with the same ID as in msg.
If __group is not zero, then message will be delivered
to the specified group.
int gfp_mask - GFP mask.
Note: When registering new callback user, connector core assigns
netlink group to the user which is equal to it's id.idx.
/*****************************************/
Protocol description.
/*****************************************/
Current offers transport layer with fixed header. Recommended
protocol which uses such header is following:
msg->seq and msg->ack are used to determine message genealogy. When
someone sends message it puts there locally unique sequence and random
acknowledge numbers. Sequence number may be copied into
nlmsghdr->nlmsg_seq too.
Sequence number is incremented with each message to be sent.
If we expect reply to our message, then sequence number in received
message MUST be the same as in original message, and acknowledge
number MUST be the same + 1.
If we receive message and it's sequence number is not equal to one we
are expecting, then it is new message. If we receive message and it's
sequence number is the same as one we are expecting, but it's
acknowledge is not equal acknowledge number in original message + 1,
then it is new message.
Obviously, protocol header contains above id.
connector allows event notification in the following form: kernel
driver or userspace process can ask connector to notify it when
selected id's will be turned on or off(registered or unregistered it's
callback). It is done by sending special command to connector
driver(it also registers itself with id={-1, -1}).
As example of usage Documentation/connector now contains cn_test.c -
testing module which uses connector to request notification and to
send messages.
/*****************************************/
Reliability.
/*****************************************/
Netlink itself is not reliable protocol, that means that messages can
be lost due to memory pressure or process' receiving queue overflowed,
so caller is warned must be prepared. That is why struct cn_msg [main
connector's message header] contains u32 seq and u32 ack fields.

View File

@ -36,7 +36,7 @@ cpufreq stats provides following statistics (explained in detail below).
All the statistics will be from the time the stats driver has been inserted
to the time when a read of a particular statistic is done. Obviously, stats
driver will not have any information about the the frequcny transitions before
driver will not have any information about the frequency transitions before
the stats driver insertion.
--------------------------------------------------------------------------------

View File

@ -60,6 +60,18 @@ all of the cpus in the system. This removes any overhead due to
load balancing code trying to pull tasks outside of the cpu exclusive
cpuset only to be prevented by the tasks' cpus_allowed mask.
A cpuset that is mem_exclusive restricts kernel allocations for
page, buffer and other data commonly shared by the kernel across
multiple users. All cpusets, whether mem_exclusive or not, restrict
allocations of memory for user space. This enables configuring a
system so that several independent jobs can share common kernel
data, such as file system pages, while isolating each jobs user
allocation in its own cpuset. To do this, construct a large
mem_exclusive cpuset to hold all the jobs, and construct child,
non-mem_exclusive cpusets for each individual job. Only a small
amount of typical kernel memory, such as requests from interrupt
handlers, is allowed to be taken outside even a mem_exclusive cpuset.
User level code may create and destroy cpusets by name in the cpuset
virtual file system, manage the attributes and permissions of these
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
@ -265,7 +277,7 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid
impacting the scheduler code in the kernel with a check for changes
in a tasks processor placement.
There is an exception to the above. If hotplug funtionality is used
There is an exception to the above. If hotplug functionality is used
to remove all the CPUs that are currently assigned to a cpuset,
then the kernel will automatically update the cpus_allowed of all
tasks attached to CPUs in that cpuset to allow all CPUs. When memory

View File

@ -223,6 +223,7 @@ CAST5 algorithm contributors:
TEA/XTEA algorithm contributors:
Aaron Grothe
Michael Ringe
Khazad algorithm contributors:
Aaron Grothe

View File

@ -1,4 +1,4 @@
Below is the orginal README file from the descore.shar package.
Below is the original README file from the descore.shar package.
------------------------------------------------------------------------------
des - fast & portable DES encryption & decryption.

91
Documentation/dcdbas.txt Normal file
View File

@ -0,0 +1,91 @@
Overview
The Dell Systems Management Base Driver provides a sysfs interface for
systems management software such as Dell OpenManage to perform system
management interrupts and host control actions (system power cycle or
power off after OS shutdown) on certain Dell systems.
Dell OpenManage requires this driver on the following Dell PowerEdge systems:
300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC,
700, and 750. Other Dell software such as the open source libsmbios project
is expected to make use of this driver, and it may include the use of this
driver on other Dell systems.
The Dell libsmbios project aims towards providing access to as much BIOS
information as possible. See http://linux.dell.com/libsmbios/main/ for
more information about the libsmbios project.
System Management Interrupt
On some Dell systems, systems management software must access certain
management information via a system management interrupt (SMI). The SMI data
buffer must reside in 32-bit address space, and the physical address of the
buffer is required for the SMI. The driver maintains the memory required for
the SMI and provides a way for the application to generate the SMI.
The driver creates the following sysfs entries for systems management
software to perform these system management interrupts:
/sys/devices/platform/dcdbas/smi_data
/sys/devices/platform/dcdbas/smi_data_buf_phys_addr
/sys/devices/platform/dcdbas/smi_data_buf_size
/sys/devices/platform/dcdbas/smi_request
Systems management software must perform the following steps to execute
a SMI using this driver:
1) Lock smi_data.
2) Write system management command to smi_data.
3) Write "1" to smi_request to generate a calling interface SMI or
"2" to generate a raw SMI.
4) Read system management command response from smi_data.
5) Unlock smi_data.
Host Control Action
Dell OpenManage supports a host control feature that allows the administrator
to perform a power cycle or power off of the system after the OS has finished
shutting down. On some Dell systems, this host control feature requires that
a driver perform a SMI after the OS has finished shutting down.
The driver creates the following sysfs entries for systems management software
to schedule the driver to perform a power cycle or power off host control
action after the system has finished shutting down:
/sys/devices/platform/dcdbas/host_control_action
/sys/devices/platform/dcdbas/host_control_smi_type
/sys/devices/platform/dcdbas/host_control_on_shutdown
Dell OpenManage performs the following steps to execute a power cycle or
power off host control action using this driver:
1) Write host control action to be performed to host_control_action.
2) Write type of SMI that driver needs to perform to host_control_smi_type.
3) Write "1" to host_control_on_shutdown to enable host control action.
4) Initiate OS shutdown.
(Driver will perform host control SMI when it is notified that the OS
has finished shutting down.)
Host Control SMI Type
The following table shows the value to write to host_control_smi_type to
perform a power cycle or power off host control action:
PowerEdge System Host Control SMI Type
---------------- ---------------------
300 HC_SMITYPE_TYPE1
1300 HC_SMITYPE_TYPE1
1400 HC_SMITYPE_TYPE2
500SC HC_SMITYPE_TYPE2
1500SC HC_SMITYPE_TYPE2
1550 HC_SMITYPE_TYPE2
600SC HC_SMITYPE_TYPE2
1600SC HC_SMITYPE_TYPE2
650 HC_SMITYPE_TYPE2
1655MC HC_SMITYPE_TYPE2
700 HC_SMITYPE_TYPE3
750 HC_SMITYPE_TYPE3

View File

@ -0,0 +1,74 @@
Purpose:
Demonstrate the usage of the new open sourced rbu (Remote BIOS Update) driver
for updating BIOS images on Dell servers and desktops.
Scope:
This document discusses the functionality of the rbu driver only.
It does not cover the support needed from aplications to enable the BIOS to
update itself with the image downloaded in to the memory.
Overview:
This driver works with Dell OpenManage or Dell Update Packages for updating
the BIOS on Dell servers (starting from servers sold since 1999), desktops
and notebooks (starting from those sold in 2005).
Please go to http://support.dell.com register and you can find info on
OpenManage and Dell Update packages (DUP).
Dell_RBU driver supports BIOS update using the monilothic image and packetized
image methods. In case of moniolithic the driver allocates a contiguous chunk
of physical pages having the BIOS image. In case of packetized the app
using the driver breaks the image in to packets of fixed sizes and the driver
would place each packet in contiguous physical memory. The driver also
maintains a link list of packets for reading them back.
If the dell_rbu driver is unloaded all the allocated memory is freed.
The rbu driver needs to have an application which will inform the BIOS to
enable the update in the next system reboot.
The user should not unload the rbu driver after downloading the BIOS image
or updating.
The driver load creates the following directories under the /sys file system.
/sys/class/firmware/dell_rbu/loading
/sys/class/firmware/dell_rbu/data
/sys/devices/platform/dell_rbu/image_type
/sys/devices/platform/dell_rbu/data
The driver supports two types of update mechanism; monolithic and packetized.
These update mechanism depends upon the BIOS currently running on the system.
Most of the Dell systems support a monolithic update where the BIOS image is
copied to a single contiguous block of physical memory.
In case of packet mechanism the single memory can be broken in smaller chuks
of contiguous memory and the BIOS image is scattered in these packets.
By default the driver uses monolithic memory for the update type. This can be
changed to contiguous during the driver load time by specifying the load
parameter image_type=packet. This can also be changed later as below
echo packet > /sys/devices/platform/dell_rbu/image_type
Do the steps below to download the BIOS image.
1) echo 1 > /sys/class/firmware/dell_rbu/loading
2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data
3) echo 0 > /sys/class/firmware/dell_rbu/loading
The /sys/class/firmware/dell_rbu/ entries will remain till the following is
done.
echo -1 > /sys/class/firmware/dell_rbu/loading
Until this step is completed the drivr cannot be unloaded.
Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to
read back the image downloaded. This is useful in case of packet update
mechanism where the above steps 1,2,3 will repeated for every packet.
By reading the /sys/devices/platform/dell_rbu/data file all packet data
downloaded can be verified in a single file.
The packets are arranged in this file one after the other in a FIFO order.
NOTE:
This driver requires a patch for firmware_class.c which has the addition
of request_firmware_nowait_nohotplug function to wortk
Also after updating the BIOS image an user mdoe application neeeds to execute
code which message the BIOS update request to the BIOS. So on the next reboot
the BIOS knows about the new image downloaded and it updates it self.
Also don't unload the rbu drive if the image has to be updated.

View File

@ -1,55 +1,74 @@
How to get the Nebula Electronics DigiTV, Pinnacle PCTV Sat, Twinhan DST + clones working
=========================================================================================
How to get the Nebula, PCTV and Twinhan DST cards working
=========================================================
1) General information
======================
This class of cards has a bt878a as the PCI interface, and
require the bttv driver.
This class of cards has a bt878a chip as the PCI interface.
The different card drivers require the bttv driver to provide the means
to access the i2c bus and the gpio pins of the bt8xx chipset.
Please pay close attention to the warning about the bttv module
options below for the DST card.
2) Compilation rules for Kernel >= 2.6.12
=========================================
1) General informations
=======================
Enable the following options:
These drivers require the bttv driver to provide the means to access
the i2c bus and the gpio pins of the bt8xx chipset.
Because of this, you need to enable
"Device drivers" => "Multimedia devices"
=> "Video For Linux" => "BT848 Video For Linux"
"Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices"
=> "DVB for Linux" "DVB Core Support" "Nebula/Pinnacle PCTV/TwinHan PCI Cards"
3) Loading Modules, described by two approaches
===============================================
Furthermore you need to enable
"Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices"
=> "DVB for Linux" "DVB Core Support" "BT8xx based PCI cards"
2) Loading Modules
==================
In general you need to load the bttv driver, which will handle the gpio and
i2c communication for us, plus the common dvb-bt8xx device driver,
which is called the backend.
The frontends for Nebula DigiTV (nxt6000), Pinnacle PCTV Sat (cx24110),
TwinHan DST + clones (dst and dst-ca) are loaded automatically by the backend.
For further details about TwinHan DST + clones see /Documentation/dvb/ci.txt.
i2c communication for us, plus the common dvb-bt8xx device driver.
The frontends for Nebula (nxt6000), Pinnacle PCTV (cx24110) and
TwinHan (dst) are loaded automatically by the dvb-bt8xx device driver.
3a) The manual approach
-----------------------
Loading modules:
modprobe bttv
modprobe dvb-bt8xx
Unloading modules:
modprobe -r dvb-bt8xx
modprobe -r bttv
3b) The automatic approach
3a) Nebula / Pinnacle PCTV
--------------------------
If not already done by installation, place a line either in
/etc/modules.conf or in /etc/modprobe.conf containing this text:
alias char-major-81 bttv
$ modprobe bttv (normally bttv is being loaded automatically by kmod)
$ modprobe dvb-bt8xx (or just place dvb-bt8xx in /etc/modules for automatic loading)
Then place a line in /etc/modules containing this text:
dvb-bt8xx
Reboot your system and have fun!
3b) TwinHan and Clones
--------------------------
$ modprobe bttv i2c_hw=1 card=0x71
$ modprobe dvb-bt8xx
$ modprobe dst
The value 0x71 will override the PCI type detection for dvb-bt8xx,
which is necessary for TwinHan cards.
If you're having an older card (blue color circuit) and card=0x71 locks
your machine, try using 0x68, too. If that does not work, ask on the
mailing list.
The DST module takes a couple of useful parameters.
verbose takes values 0 to 4. These values control the verbosity level,
and can be used to debug also.
verbose=0 means complete disabling of messages
1 only error messages are displayed
2 notifications are also displayed
3 informational messages are also displayed
4 debug setting
dst_addons takes values 0 and 0x20. A value of 0 means it is a FTA card.
0x20 means it has a Conditional Access slot.
The autodected values are determined bythe cards 'response
string' which you can see in your logs e.g.
dst_get_device_id: Recognise [DSTMCI]
--
Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham, Uwe Bugla
Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham

View File

@ -23,7 +23,6 @@ This application requires the following to function properly as of now.
eg: $ szap -c channels.conf -r "TMC" -x
(b) a channels.conf containing a valid PMT PID
eg: TMC:11996:h:0:27500:278:512:650:321
here 278 is a valid PMT PID. the rest of the values are the
@ -31,13 +30,7 @@ This application requires the following to function properly as of now.
(c) after running a szap, you have to run ca_zap, for the
descrambler to function,
eg: $ ca_zap patched_channels.conf "TMC"
The patched means a patch to apply to scan, such that scan can
generate a channels.conf_with pmt, which has this PMT PID info
(NOTE: szap cannot use this channels.conf with the PMT_PID)
eg: $ ca_zap channels.conf "TMC"
(d) Hopeflly Enjoy your favourite subscribed channel as you do with
a FTA card.

View File

@ -7,7 +7,7 @@ To protect itself the kernel has to verify this address.
In older versions of Linux this was done with the
int verify_area(int type, const void * addr, unsigned long size)
function.
function (which has since been replaced by access_ok()).
This function verified that the memory area starting at address
addr and of size size was accessible for the operation specified

View File

@ -0,0 +1,14 @@
Bugs
====
I currently don't know of any bug. Please do send reports to:
- linux-fbdev-devel@lists.sourceforge.net
- Knut_Petersen@t-online.de.
Untested features
=================
All LCD stuff is untested. If it worked in tridentfb, it should work in
cyblafb. Please test and report the results to Knut_Petersen@t-online.de.

View File

@ -0,0 +1,7 @@
Thanks to
=========
* Alan Hourihane, for writing the X trident driver
* Jani Monoses, for writing the tridentfb driver
* Antonino A. Daplas, for review of the first published
version of cyblafb and some code
* Jochen Hein, for testing and a helpfull bug report

View File

@ -0,0 +1,17 @@
Available Documentation
=======================
Apollo PLE 133 Chipset VT8601A North Bridge Datasheet, Rev. 1.82, October 22,
2001, available from VIA:
http://www.viavpsd.com/product/6/15/DS8601A182.pdf
The datasheet is incomplete, some registers that need to be programmed are not
explained at all and important bits are listed as "reserved". But you really
need the datasheet to understand the code. "p. xxx" comments refer to page
numbers of this document.
XFree/XOrg drivers are available and of good quality, looking at the code
there is a good idea if the datasheet does not provide enough information
or if the datasheet seems to be wrong.

View File

@ -0,0 +1,155 @@
#
# Sample fb.modes file
#
# Provides an incomplete list of working modes for
# the cyberblade/i1 graphics core.
#
# The value 4294967256 is used instead of -40. Of course, -40 is not
# a really reasonable value, but chip design does not always follow
# logic. Believe me, it's ok, and it's the way the BIOS does it.
#
# fbset requires 4294967256 in fb.modes and -40 as an argument to
# the -t parameter. That's also not too reasonable, and it might change
# in the future or might even be differt for your current version.
#
mode "640x480-50"
geometry 640 480 640 3756 8
timings 47619 4294967256 24 17 0 216 3
endmode
mode "640x480-60"
geometry 640 480 640 3756 8
timings 39682 4294967256 24 17 0 216 3
endmode
mode "640x480-70"
geometry 640 480 640 3756 8
timings 34013 4294967256 24 17 0 216 3
endmode
mode "640x480-72"
geometry 640 480 640 3756 8
timings 33068 4294967256 24 17 0 216 3
endmode
mode "640x480-75"
geometry 640 480 640 3756 8
timings 31746 4294967256 24 17 0 216 3
endmode
mode "640x480-80"
geometry 640 480 640 3756 8
timings 29761 4294967256 24 17 0 216 3
endmode
mode "640x480-85"
geometry 640 480 640 3756 8
timings 28011 4294967256 24 17 0 216 3
endmode
mode "800x600-50"
geometry 800 600 800 3221 8
timings 30303 96 24 14 0 136 11
endmode
mode "800x600-60"
geometry 800 600 800 3221 8
timings 25252 96 24 14 0 136 11
endmode
mode "800x600-70"
geometry 800 600 800 3221 8
timings 21645 96 24 14 0 136 11
endmode
mode "800x600-72"
geometry 800 600 800 3221 8
timings 21043 96 24 14 0 136 11
endmode
mode "800x600-75"
geometry 800 600 800 3221 8
timings 20202 96 24 14 0 136 11
endmode
mode "800x600-80"
geometry 800 600 800 3221 8
timings 18939 96 24 14 0 136 11
endmode
mode "800x600-85"
geometry 800 600 800 3221 8
timings 17825 96 24 14 0 136 11
endmode
mode "1024x768-50"
geometry 1024 768 1024 2815 8
timings 19054 144 24 29 0 120 3
endmode
mode "1024x768-60"
geometry 1024 768 1024 2815 8
timings 15880 144 24 29 0 120 3
endmode
mode "1024x768-70"
geometry 1024 768 1024 2815 8
timings 13610 144 24 29 0 120 3
endmode
mode "1024x768-72"
geometry 1024 768 1024 2815 8
timings 13232 144 24 29 0 120 3
endmode
mode "1024x768-75"
geometry 1024 768 1024 2815 8
timings 12703 144 24 29 0 120 3
endmode
mode "1024x768-80"
geometry 1024 768 1024 2815 8
timings 11910 144 24 29 0 120 3
endmode
mode "1024x768-85"
geometry 1024 768 1024 2815 8
timings 11209 144 24 29 0 120 3
endmode
mode "1280x1024-50"
geometry 1280 1024 1280 2662 8
timings 11114 232 16 39 0 160 3
endmode
mode "1280x1024-60"
geometry 1280 1024 1280 2662 8
timings 9262 232 16 39 0 160 3
endmode
mode "1280x1024-70"
geometry 1280 1024 1280 2662 8
timings 7939 232 16 39 0 160 3
endmode
mode "1280x1024-72"
geometry 1280 1024 1280 2662 8
timings 7719 232 16 39 0 160 3
endmode
mode "1280x1024-75"
geometry 1280 1024 1280 2662 8
timings 7410 232 16 39 0 160 3
endmode
mode "1280x1024-80"
geometry 1280 1024 1280 2662 8
timings 6946 232 16 39 0 160 3
endmode
mode "1280x1024-85"
geometry 1280 1024 1280 2662 8
timings 6538 232 16 39 0 160 3
endmode

View File

@ -0,0 +1,80 @@
Speed
=====
CyBlaFB is much faster than tridentfb and vesafb. Compare the performance data
for mode 1280x1024-[8,16,32]@61 Hz.
Test 1: Cat a file with 2000 lines of 0 characters.
Test 2: Cat a file with 2000 lines of 80 characters.
Test 3: Cat a file with 2000 lines of 160 characters.
All values show system time use in seconds, kernel 2.6.12 was used for
the measurements. 2.6.13 is a bit slower, 2.6.14 hopefully will include a
patch that speeds up kernel bitblitting a lot ( > 20%).
+-----------+-----------------------------------------------------+
| | not accelerated |
| TRIDENTFB +-----------------+-----------------+-----------------+
| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp |
| | noypan | ypan | noypan | ypan | noypan | ypan |
+-----------+--------+--------+--------+--------+--------+--------+
| Test 1 | 4.31 | 4.33 | 6.05 | 12.81 | ---- | ---- |
| Test 2 | 67.94 | 5.44 | 123.16 | 14.79 | ---- | ---- |
| Test 3 | 131.36 | 6.55 | 240.12 | 16.76 | ---- | ---- |
+-----------+--------+--------+--------+--------+--------+--------+
| Comments | | | completely bro- |
| | | | ken, monitor |
| | | | switches off |
+-----------+-----------------+-----------------+-----------------+
+-----------+-----------------------------------------------------+
| | accelerated |
| TRIDENTFB +-----------------+-----------------+-----------------+
| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp |
| | noypan | ypan | noypan | ypan | noypan | ypan |
+-----------+--------+--------+--------+--------+--------+--------+
| Test 1 | ---- | ---- | 20.62 | 1.22 | ---- | ---- |
| Test 2 | ---- | ---- | 22.61 | 3.19 | ---- | ---- |
| Test 3 | ---- | ---- | 24.59 | 5.16 | ---- | ---- |
+-----------+--------+--------+--------+--------+--------+--------+
| Comments | broken, writing | broken, ok only | completely bro- |
| | to wrong places | if bgcolor is | ken, monitor |
| | on screen + bug | black, bug in | switches off |
| | in fillrect() | fillrect() | |
+-----------+-----------------+-----------------+-----------------+
+-----------+-----------------------------------------------------+
| | not accelerated |
| VESAFB +-----------------+-----------------+-----------------+
| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp |
| | noypan | ypan | noypan | ypan | noypan | ypan |
+-----------+--------+--------+--------+--------+--------+--------+
| Test 1 | 4.26 | 3.76 | 5.99 | 7.23 | ---- | ---- |
| Test 2 | 65.65 | 4.89 | 120.88 | 9.08 | ---- | ---- |
| Test 3 | 126.91 | 5.94 | 235.77 | 11.03 | ---- | ---- |
+-----------+--------+--------+--------+--------+--------+--------+
| Comments | vga=0x307 | vga=0x31a | vga=0x31b not |
| | fh=80kHz | fh=80kHz | supported by |
| | fv=75kHz | fv=75kHz | video BIOS and |
| | | | hardware |
+-----------+-----------------+-----------------+-----------------+
+-----------+-----------------------------------------------------+
| | accelerated |
| CYBLAFB +-----------------+-----------------+-----------------+
| | 8 bpp | 16 bpp | 32 bpp |
| | noypan | ypan | noypan | ypan | noypan | ypan |
+-----------+--------+--------+--------+--------+--------+--------+
| Test 1 | 8.02 | 0.23 | 19.04 | 0.61 | 57.12 | 2.74 |
| Test 2 | 8.38 | 0.55 | 19.39 | 0.92 | 57.54 | 3.13 |
| Test 3 | 8.73 | 0.86 | 19.74 | 1.24 | 57.95 | 3.51 |
+-----------+--------+--------+--------+--------+--------+--------+
| Comments | | | |
| | | | |
| | | | |
| | | | |
+-----------+-----------------+-----------------+-----------------+

View File

@ -0,0 +1,32 @@
TODO / Missing features
=======================
Verify LCD stuff "stretch" and "center" options are
completely untested ... this code needs to be
verified. As I don't have access to such
hardware, please contact me if you are
willing run some tests.
Interlaced video modes The reason that interleaved
modes are disabled is that I do not know
the meaning of the vertical interlace
parameter. Also the datasheet mentions a
bit d8 of a horizontal interlace parameter,
but nowhere the lower 8 bits. Please help
if you can.
low-res double scan modes Who needs it?
accelerated color blitting Who needs it? The console driver does use color
blitting for nothing but drawing the penguine,
everything else is done using color expanding
blitting of 1bpp character bitmaps.
xpanning Who needs it?
ioctls Who needs it?
TV-out Will be done later
??? Feel free to contact me if you have any
feature requests

View File

@ -0,0 +1,206 @@
CyBlaFB is a framebuffer driver for the Cyberblade/i1 graphics core integrated
into the VIA Apollo PLE133 (aka vt8601) south bridge. It is developed and
tested using a VIA EPIA 5000 board.
Cyblafb - compiled into the kernel or as a module?
==================================================
You might compile cyblafb either as a module or compile it permanently into the
kernel.
Unless you have a real reason to do so you should not compile both vesafb and
cyblafb permanently into the kernel. It's possible and it helps during the
developement cycle, but it's useless and will at least block some otherwise
usefull memory for ordinary users.
Selecting Modes
===============
Startup Mode
============
First of all, you might use the "vga=???" boot parameter as it is
documented in vesafb.txt and svga.txt. Cyblafb will detect the video
mode selected and will use the geometry and timings found by
inspecting the hardware registers.
video=cyblafb vga=0x317
Alternatively you might use a combination of the mode, ref and bpp
parameters. If you compiled the driver into the kernel, add something
like this to the kernel command line:
video=cyblafb:1280x1024,bpp=16,ref=50 ...
If you compiled the driver as a module, the same mode would be
selected by the following command:
modprobe cyblafb mode=1280x1024 bpp=16 ref=50 ...
None of the modes possible to select as startup modes are affected by
the problems described at the end of the next subsection.
Mode changes using fbset
========================
You might use fbset to change the video mode, see "man fbset". Cyblafb
generally does assume that you know what you are doing. But it does
some checks, especially those that are needed to prevent you from
damaging your hardware.
- only 8, 16, 24 and 32 bpp video modes are accepted
- interlaced video modes are not accepted
- double scan video modes are not accepted
- if a flat panel is found, cyblafb does not allow you
to program a resolution higher than the physical
resolution of the flat panel monitor
- cyblafb does not allow xres to differ from xres_virtual
- cyblafb does not allow vclk to exceed 230 MHz. As 32 bpp
and (currently) 24 bit modes use a doubled vclk internally,
the dotclock limit as seen by fbset is 115 MHz for those
modes and 230 MHz for 8 and 16 bpp modes.
Any request that violates the rules given above will be ignored and
fbset will return an error.
If you program a virtual y resolution higher than the hardware limit,
cyblafb will silently decrease that value to the highest possible
value.
Attempts to disable acceleration are ignored.
Some video modes that should work do not work as expected. If you use
the standard fb.modes, fbset 640x480-60 will program that mode, but
you will see a vertical area, about two characters wide, with only
much darker characters than the other characters on the screen.
Cyblafb does allow that mode to be set, as it does not violate the
official specifications. It would need a lot of code to reliably sort
out all invalid modes, playing around with the margin values will
give a valid mode quickly. And if cyblafb would detect such an invalid
mode, should it silently alter the requested values or should it
report an error? Both options have some pros and cons. As stated
above, none of the startup modes are affected, and if you set
verbosity to 1 or higher, cyblafb will print the fbset command that
would be needed to program that mode using fbset.
Other Parameters
================
crt don't autodetect, assume monitor connected to
standard VGA connector
fp don't autodetect, assume flat panel display
connected to flat panel monitor interface
nativex inform driver about native x resolution of
flat panel monitor connected to special
interface (should be autodetected)
stretch stretch image to adapt low resolution modes to
higer resolutions of flat panel monitors
connected to special interface
center center image to adapt low resolution modes to
higer resolutions of flat panel monitors
connected to special interface
memsize use if autodetected memsize is wrong ...
should never be necessary
nopcirr disable PCI read retry
nopciwr disable PCI write retry
nopcirb disable PCI read bursts
nopciwb disable PCI write bursts
bpp bpp for specified modes
valid values: 8 || 16 || 24 || 32
ref refresh rate for specified mode
valid values: 50 <= ref <= 85
mode 640x480 or 800x600 or 1024x768 or 1280x1024
if not specified, the startup mode will be detected
and used, so you might also use the vga=??? parameter
described in vesafb.txt. If you do not specify a mode,
bpp and ref parameters are ignored.
verbosity 0 is the default, increase to at least 2 for every
bug report!
vesafb allows cyblafb to be loaded after vesafb has been
loaded. See sections "Module unloading ...".
Development hints
=================
It's much faster do compile a module and to load the new version after
unloading the old module than to compile a new kernel and to reboot. So if you
try to work on cyblafb, it might be a good idea to use cyblafb as a module.
In real life, fast often means dangerous, and that's also the case here. If
you introduce a serious bug when cyblafb is compiled into the kernel, the
kernel will lock or oops with a high probability before the file system is
mounted, and the danger for your data is low. If you load a broken own version
of cyblafb on a running system, the danger for the integrity of the file
system is much higher as you might need a hard reset afterwards. Decide
yourself.
Module unloading, the vfb method
================================
If you want to unload/reload cyblafb using the virtual framebuffer, you need
to enable vfb support in the kernel first. After that, load the modules as
shown below:
modprobe vfb vfb_enable=1
modprobe fbcon
modprobe cyblafb
fbset -fb /dev/fb1 1280x1024-60 -vyres 2662
con2fb /dev/fb1 /dev/tty1
...
If you now made some changes to cyblafb and want to reload it, you might do it
as show below:
con2fb /dev/fb0 /dev/tty1
...
rmmod cyblafb
modprobe cyblafb
con2fb /dev/fb1 /dev/tty1
...
Of course, you might choose another mode, and most certainly you also want to
map some other /dev/tty* to the real framebuffer device. You might also choose
to compile fbcon as a kernel module or place it permanently in the kernel.
I do not know of any way to unload fbcon, and fbcon will prevent the
framebuffer device loaded first from unloading. [If there is a way, then
please add a description here!]
Module unloading, the vesafb method
===================================
Configure the kernel:
<*> Support for frame buffer devices
[*] VESA VGA graphics support
<M> Cyberblade/i1 support
Add e.g. "video=vesafb:ypan vga=0x307" to the kernel parameters. The ypan
parameter is important, choose any vga parameter you like as long as it is
a graphics mode.
After booting, load cyblafb without any mode and bpp parameter and assign
cyblafb to individual ttys using con2fb, e.g.:
modprobe cyblafb vesafb=1
con2fb /dev/fb1 /dev/tty1
Unloading cyblafb works without problems after you assign vesafb to all
ttys again, e.g.:
con2fb /dev/fb0 /dev/tty1
rmmod cyblafb

View File

@ -0,0 +1,85 @@
I tried the following framebuffer drivers:
- TRIDENTFB is full of bugs. Acceleration is broken for Blade3D
graphics cores like the cyberblade/i1. It claims to support a great
number of devices, but documentation for most of these devices is
unfortunately not available. There is _no_ reason to use tridentfb
for cyberblade/i1 + CRT users. VESAFB is faster, and the one
advantage, mode switching, is broken in tridentfb.
- VESAFB is used by many distributions as a standard. Vesafb does
not support mode switching. VESAFB is a bit faster than the working
configurations of TRIDENTFB, but it is still too slow, even if you
use ypan.
- EPIAFB (you'll find it on sourceforge) supports the Cyberblade/i1
graphics core, but it still has serious bugs and developement seems
to have stopped. This is the one driver with TV-out support. If you
do need this feature, try epiafb.
None of these drivers was a real option for me.
I believe that is unreasonable to change code that announces to support 20
devices if I only have more or less sufficient documentation for exactly one
of these. The risk of breaking device foo while fixing device bar is too high.
So I decided to start CyBlaFB as a stripped down tridentfb.
All code specific to other Trident chips has been removed. After that there
were a lot of cosmetic changes to increase the readability of the code. All
register names were changed to those mnemonics used in the datasheet. Function
and macro names were changed if they hindered easy understanding of the code.
After that I debugged the code and implemented some new features. I'll try to
give a little summary of the main changes:
- calculation of vertical and horizontal timings was fixed
- video signal quality has been improved dramatically
- acceleration:
- fillrect and copyarea were fixed and reenabled
- color expanding imageblit was newly implemented, color
imageblit (only used to draw the penguine) still uses the
generic code.
- init of the acceleration engine was improved and moved to a
place where it really works ...
- sync function has a timeout now and tries to reset and
reinit the accel engine if necessary
- fewer slow copyarea calls when doing ypan scrolling by using
undocumented bit d21 of screen start address stored in
CR2B[5]. BIOS does use it also, so this should be safe.
- cyblafb rejects any attempt to set modes that would cause vclk
values above reasonable 230 MHz. 32bit modes use a clock
multiplicator of 2, so fbset does show the correct values for
pixclock but not for vclk in this case. The fbset limit is 115 MHz
for 32 bpp modes.
- cyblafb rejects modes known to be broken or unimplemented (all
interlaced modes, all doublescan modes for now)
- cyblafb now works independant of the video mode in effect at startup
time (tridentfb does not init all needed registers to reasonable
values)
- switching between video modes does work reliably now
- the first video mode now is the one selected on startup using the
vga=???? mechanism or any of
- 640x480, 800x600, 1024x768, 1280x1024
- 8, 16, 24 or 32 bpp
- refresh between 50 Hz and 85 Hz, 1 Hz steps (1280x1024-32
is limited to 63Hz)
- pci retry and pci burst mode are settable (try to disable if you
experience latency problems)
- built as a module cyblafb might be unloaded and reloaded using
the vfb module and con2vt or might be used together with vesafb

View File

@ -5,6 +5,7 @@ Intel 810/815 Framebuffer driver
March 17, 2002
First Released: July 2001
Last Update: September 12, 2005
================================================================
A. Introduction
@ -44,6 +45,8 @@ B. Features
- Hardware Cursor Support
- Supports EDID probing either by DDC/I2C or through the BIOS
C. List of available options
a. "video=i810fb"
@ -52,14 +55,17 @@ C. List of available options
Recommendation: required
b. "xres:<value>"
select horizontal resolution in pixels
select horizontal resolution in pixels. (This parameter will be
ignored if 'mode_option' is specified. See 'o' below).
Recommendation: user preference
(default = 640)
c. "yres:<value>"
select vertical resolution in scanlines. If Discrete Video Timings
is enabled, this will be ignored and computed as 3*xres/4.
is enabled, this will be ignored and computed as 3*xres/4. (This
parameter will be ignored if 'mode_option' is specified. See 'o'
below)
Recommendation: user preference
(default = 480)
@ -86,7 +92,8 @@ C. List of available options
g. "hsync1/hsync2:<value>"
select the minimum and maximum Horizontal Sync Frequency of the
monitor in KHz. If a using a fixed frequency monitor, hsync1 must
be equal to hsync2.
be equal to hsync2. If EDID probing is successful, these will be
ignored and values will be taken from the EDID block.
Recommendation: check monitor manual for correct values
default (29/30)
@ -94,7 +101,8 @@ C. List of available options
h. "vsync1/vsync2:<value>"
select the minimum and maximum Vertical Sync Frequency of the monitor
in Hz. You can also use this option to lock your monitor's refresh
rate.
rate. If EDID probing is successful, these will be ignored and values
will be taken from the EDID block.
Recommendation: check monitor manual for correct values
(default = 60/60)
@ -154,6 +162,10 @@ C. List of available options
Recommendation: do not set
(default = not set)
o. <xres>x<yres>[-<bpp>][@<refresh>]
The driver will now accept specification of boot mode option. If this
is specified, the options 'xres' and 'yres' will be ignored. See
Documentation/fb/modedb.txt for usage.
D. Kernel booting
@ -176,7 +188,10 @@ will be computed based on the hsync1/hsync2 and vsync1/vsync2 values.
IMPORTANT:
You must include hsync1, hsync2, vsync1 and vsync2 to enable video modes
better than 640x480 at 60Hz.
better than 640x480 at 60Hz. HOWEVER, if your chipset/display combination
supports I2C and has an EDID block, you can safely exclude hsync1, hsync2,
vsync1 and vsync2 parameters. These parameters will be taken from the EDID
block.
E. Module options
@ -217,31 +232,20 @@ F. Setup
This is required. The option is under "Character Devices"
d. Under "Graphics Support", select "Intel 810/815" either statically
or as a module. Choose "use VESA GTF for video timings" if you
need to maximize the capability of your display. To be on the
or as a module. Choose "use VESA Generalized Timing Formula" if
you need to maximize the capability of your display. To be on the
safe side, you can leave this unselected.
e. If you want a framebuffer console, enable it under "Console
e. If you want support for DDC/I2C probing (Plug and Play Displays),
set 'Enable DDC Support' to 'y'. To make this option appear, set
'use VESA Generalized Timing Formula' to 'y'.
f. If you want a framebuffer console, enable it under "Console
Drivers"
f. Compile your kernel.
g. Compile your kernel.
g. Load the driver as described in section D and E.
Optional:
h. If you are going to run XFree86 with its native drivers, the
standard XFree86 4.1.0 and 4.2.0 drivers should work as is.
However, there's a bug in the XFree86 i810 drivers. It attempts
to use XAA even when switched to the console. This will crash
your server. I have a fix at this site:
http://i810fb.sourceforge.net.
You can either use the patch, or just replace
/usr/X11R6/lib/modules/drivers/i810_drv.o
with the one provided at the website.
h. Load the driver as described in section D and E.
i. Try the DirectFB (http://www.directfb.org) + the i810 gfxdriver
patch to see the chipset in action (or inaction :-).

View File

@ -20,12 +20,83 @@ in a video= option, fbmem considers that to be a global video mode option.
Valid mode specifiers (mode_option argument):
<xres>x<yres>[-<bpp>][@<refresh>]
<xres>x<yres>[M][R][-<bpp>][@<refresh>][i][m]
<name>[-<bpp>][@<refresh>]
with <xres>, <yres>, <bpp> and <refresh> decimal numbers and <name> a string.
Things between square brackets are optional.
If 'M' is specified in the mode_option argument (after <yres> and before
<bpp> and <refresh>, if specified) the timings will be calculated using
VESA(TM) Coordinated Video Timings instead of looking up the mode from a table.
If 'R' is specified, do a 'reduced blanking' calculation for digital displays.
If 'i' is specified, calculate for an interlaced mode. And if 'm' is
specified, add margins to the calculation (1.8% of xres rounded down to 8
pixels and 1.8% of yres).
Sample usage: 1024x768M@60m - CVT timing with margins
***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo *****
What is the VESA(TM) Coordinated Video Timings (CVT)?
From the VESA(TM) Website:
"The purpose of CVT is to provide a method for generating a consistent
and coordinated set of standard formats, display refresh rates, and
timing specifications for computer display products, both those
employing CRTs, and those using other display technologies. The
intention of CVT is to give both source and display manufacturers a
common set of tools to enable new timings to be developed in a
consistent manner that ensures greater compatibility."
This is the third standard approved by VESA(TM) concerning video timings. The
first was the Discrete Video Timings (DVT) which is a collection of
pre-defined modes approved by VESA(TM). The second is the Generalized Timing
Formula (GTF) which is an algorithm to calculate the timings, given the
pixelclock, the horizontal sync frequency, or the vertical refresh rate.
The GTF is limited by the fact that it is designed mainly for CRT displays.
It artificially increases the pixelclock because of its high blanking
requirement. This is inappropriate for digital display interface with its high
data rate which requires that it conserves the pixelclock as much as possible.
Also, GTF does not take into account the aspect ratio of the display.
The CVT addresses these limitations. If used with CRT's, the formula used
is a derivation of GTF with a few modifications. If used with digital
displays, the "reduced blanking" calculation can be used.
From the framebuffer subsystem perspective, new formats need not be added
to the global mode database whenever a new mode is released by display
manufacturers. Specifying for CVT will work for most, if not all, relatively
new CRT displays and probably with most flatpanels, if 'reduced blanking'
calculation is specified. (The CVT compatibility of the display can be
determined from its EDID. The version 1.3 of the EDID has extra 128-byte
blocks where additional timing information is placed. As of this time, there
is no support yet in the layer to parse this additional blocks.)
CVT also introduced a new naming convention (should be seen from dmesg output):
<pix>M<a>[-R]
where: pix = total amount of pixels in MB (xres x yres)
M = always present
a = aspect ratio (3 - 4:3; 4 - 5:4; 9 - 15:9, 16:9; A - 16:10)
-R = reduced blanking
example: .48M3-R - 800x600 with reduced blanking
Note: VESA(TM) has restrictions on what is a standard CVT timing:
- aspect ratio can only be one of the above values
- acceptable refresh rates are 50, 60, 70 or 85 Hz only
- if reduced blanking, the refresh rate must be at 60Hz
If one of the above are not satisfied, the kernel will print a warning but the
timings will still be calculated.
***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo *****
To find a suitable video mode, you just call
int __init fb_find_mode(struct fb_var_screeninfo *var,

View File

@ -17,23 +17,6 @@ Who: Greg Kroah-Hartman <greg@kroah.com>
---------------------------
What: ACPI S4bios support
When: May 2005
Why: Noone uses it, and it probably does not work, anyway. swsusp is
faster, more reliable, and people are actually using it.
Who: Pavel Machek <pavel@suse.cz>
---------------------------
What: PCI Name Database (CONFIG_PCI_NAMES)
When: July 2005
Why: It bloats the kernel unnecessarily, and is handled by userspace better
(pciutils supports it.) Will eliminate the need to try to keep the
pci.ids file in sync with the sf.net database all of the time.
Who: Greg Kroah-Hartman <gregkh@suse.de>
---------------------------
What: io_remap_page_range() (macro or function)
When: September 2005
Why: Replaced by io_remap_pfn_range() which allows more memory space
@ -51,14 +34,6 @@ Who: Adrian Bunk <bunk@stusta.de>
---------------------------
What: register_ioctl32_conversion() / unregister_ioctl32_conversion()
When: April 2005
Why: Replaced by ->compat_ioctl in file_operations and other method
vecors.
Who: Andi Kleen <ak@muc.de>, Christoph Hellwig <hch@lst.de>
---------------------------
What: RCU API moves to EXPORT_SYMBOL_GPL
When: April 2006
Files: include/linux/rcupdate.h, kernel/rcupdate.c
@ -74,14 +49,6 @@ Who: Paul E. McKenney <paulmck@us.ibm.com>
---------------------------
What: remove verify_area()
When: July 2006
Files: Various uaccess.h headers.
Why: Deprecated and redundant. access_ok() should be used instead.
Who: Jesper Juhl <juhl-lkml@dif.dk>
---------------------------
What: IEEE1394 Audio and Music Data Transmission Protocol driver,
Connection Management Procedures driver
When: November 2005
@ -102,16 +69,6 @@ Who: Jody McIntyre <scjody@steamballoon.com>
---------------------------
What: register_serial/unregister_serial
When: September 2005
Why: This interface does not allow serial ports to be registered against
a struct device, and as such does not allow correct power management
of such ports. 8250-based ports should use serial8250_register_port
and serial8250_unregister_port, or platform devices instead.
Who: Russell King <rmk@arm.linux.org.uk>
---------------------------
What: i2c sysfs name change: in1_ref, vid deprecated in favour of cpu0_vid
When: November 2005
Files: drivers/i2c/chips/adm1025.c, drivers/i2c/chips/adm1026.c
@ -135,3 +92,15 @@ Why: With the 16-bit PCMCIA subsystem now behaving (almost) like a
pcmciautils package available at
http://kernel.org/pub/linux/utils/kernel/pcmcia/
Who: Dominik Brodowski <linux@brodo.de>
---------------------------
What: ip_queue and ip6_queue (old ipv4-only and ipv6-only netfilter queue)
When: December 2005
Why: This interface has been obsoleted by the new layer3-independent
"nfnetlink_queue". The Kernel interface is compatible, so the old
ip[6]tables "QUEUE" targets still work and will transparently handle
all packets into nfnetlink queue number 0. Userspace users will have
to link against API-compatible library on top of libnfnetlink_queue
instead of the current 'libipq'.
Who: Harald Welte <laforge@netfilter.org>

View File

@ -0,0 +1,123 @@
File management in the Linux kernel
-----------------------------------
This document describes how locking for files (struct file)
and file descriptor table (struct files) works.
Up until 2.6.12, the file descriptor table has been protected
with a lock (files->file_lock) and reference count (files->count).
->file_lock protected accesses to all the file related fields
of the table. ->count was used for sharing the file descriptor
table between tasks cloned with CLONE_FILES flag. Typically
this would be the case for posix threads. As with the common
refcounting model in the kernel, the last task doing
a put_files_struct() frees the file descriptor (fd) table.
The files (struct file) themselves are protected using
reference count (->f_count).
In the new lock-free model of file descriptor management,
the reference counting is similar, but the locking is
based on RCU. The file descriptor table contains multiple
elements - the fd sets (open_fds and close_on_exec, the
array of file pointers, the sizes of the sets and the array
etc.). In order for the updates to appear atomic to
a lock-free reader, all the elements of the file descriptor
table are in a separate structure - struct fdtable.
files_struct contains a pointer to struct fdtable through
which the actual fd table is accessed. Initially the
fdtable is embedded in files_struct itself. On a subsequent
expansion of fdtable, a new fdtable structure is allocated
and files->fdtab points to the new structure. The fdtable
structure is freed with RCU and lock-free readers either
see the old fdtable or the new fdtable making the update
appear atomic. Here are the locking rules for
the fdtable structure -
1. All references to the fdtable must be done through
the files_fdtable() macro :
struct fdtable *fdt;
rcu_read_lock();
fdt = files_fdtable(files);
....
if (n <= fdt->max_fds)
....
...
rcu_read_unlock();
files_fdtable() uses rcu_dereference() macro which takes care of
the memory barrier requirements for lock-free dereference.
The fdtable pointer must be read within the read-side
critical section.
2. Reading of the fdtable as described above must be protected
by rcu_read_lock()/rcu_read_unlock().
3. For any update to the the fd table, files->file_lock must
be held.
4. To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup.
An example :
struct file *file;
rcu_read_lock();
file = fcheck(fd);
if (file) {
...
}
....
rcu_read_unlock();
5. Handling of the file structures is special. Since the look-up
of the fd (fget()/fget_light()) are lock-free, it is possible
that look-up may race with the last put() operation on the
file structure. This is avoided using the rcuref APIs
on ->f_count :
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
if (rcuref_inc_lf(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
}
rcu_read_unlock();
....
return file;
rcuref_inc_lf() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail
fget()/fget_light().
6. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
and fcheck()/fcheck_files() which take care of these issues.
7. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
For example :
spin_lock(&files->file_lock);
fd = locate_fd(files, file, start);
if (fd >= 0) {
/* locate_fd() may have expanded fdtable, load the ptr */
fdt = files_fdtable(files);
FD_SET(fd, fdt->open_fds);
FD_CLR(fd, fdt->close_on_exec);
spin_unlock(&files->file_lock);
.....
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().

View File

@ -0,0 +1,315 @@
Definitions
~~~~~~~~~~~
Userspace filesystem:
A filesystem in which data and metadata are provided by an ordinary
userspace process. The filesystem can be accessed normally through
the kernel interface.
Filesystem daemon:
The process(es) providing the data and metadata of the filesystem.
Non-privileged mount (or user mount):
A userspace filesystem mounted by a non-privileged (non-root) user.
The filesystem daemon is running with the privileges of the mounting
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
Mount owner:
The user who does the mounting.
User:
The user who is performing filesystem operations.
What is FUSE?
~~~~~~~~~~~~~
FUSE is a userspace filesystem framework. It consists of a kernel
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
(fusermount).
One of the most important features of FUSE is allowing secure,
non-privileged mounts. This opens up new possibilities for the use of
filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
The userspace library and utilities are available from the FUSE
homepage:
http://fuse.sourceforge.net/
Mount options
~~~~~~~~~~~~~
'fd=N'
The file descriptor to use for communication between the userspace
filesystem and the kernel. The file descriptor must have been
obtained by opening the FUSE device ('/dev/fuse').
'rootmode=M'
The file mode of the filesystem's root in octal representation.
'user_id=N'
The numeric user id of the mount owner.
'group_id=N'
The numeric group id of the mount owner.
'default_permissions'
By default FUSE doesn't check file access permissions, the
filesystem is free to implement it's access policy or leave it to
the underlying file access mechanism (e.g. in case of network
filesystems). This option enables permission checking, restricting
access based on file mode. This is option is usually useful
together with the 'allow_other' mount option.
'allow_other'
This option overrides the security measure restricting file access
to the user mounting the filesystem. This option is by default only
allowed to root, but this restriction can be removed with a
(userspace) configuration option.
'max_read=N'
With this option the maximum size of read operations can be set.
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
How do non-privileged mounts work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since the mount() system call is a privileged operation, a helper
program (fusermount) is needed, which is installed setuid root.
The implication of providing non-privileged mounts is that the mount
owner must not be able to use this capability to compromise the
system. Obvious requirements arising from this are:
A) mount owner should not be able to get elevated privileges with the
help of the mounted filesystem
B) mount owner should not get illegitimate access to information from
other users' and the super user's processes
C) mount owner should not be able to induce undesired behavior in
other users' or the super user's processes
How are requirements fulfilled?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A) The mount owner could gain elevated privileges by either:
1) creating a filesystem containing a device file, then opening
this device
2) creating a filesystem containing a suid or sgid application,
then executing this application
The solution is not to allow opening device files and ignore
setuid and setgid bits when executing programs. To ensure this
fusermount always adds "nosuid" and "nodev" to the mount options
for non-privileged mounts.
B) If another user is accessing files or directories in the
filesystem, the filesystem daemon serving requests can record the
exact sequence and timing of operations performed. This
information is otherwise inaccessible to the mount owner, so this
counts as an information leak.
The solution to this problem will be presented in point 2) of C).
C) There are several ways in which the mount owner can induce
undesired behavior in other users' processes, such as:
1) mounting a filesystem over a file or directory which the mount
owner could otherwise not be able to modify (or could only
make limited modifications).
This is solved in fusermount, by checking the access
permissions on the mountpoint and only allowing the mount if
the mount owner can do unlimited modification (has write
access to the mountpoint, and mountpoint is not a "sticky"
directory)
2) Even if 1) is solved the mount owner can change the behavior
of other users' processes.
i) It can slow down or indefinitely delay the execution of a
filesystem operation creating a DoS against the user or the
whole system. For example a suid application locking a
system file, and then accessing a file on the mount owner's
filesystem could be stopped, and thus causing the system
file to be locked forever.
ii) It can present files or directories of unlimited length, or
directory structures of unlimited depth, possibly causing a
system process to eat up diskspace, memory or other
resources, again causing DoS.
The solution to this as well as B) is not to allow processes
to access the filesystem, which could otherwise not be
monitored or manipulated by the mount owner. Since if the
mount owner can ptrace a process, it can do all of the above
without using a FUSE mount, the same criteria as used in
ptrace can be used to check if a process is allowed to access
the filesystem or not.
Note that the ptrace check is not strictly necessary to
prevent B/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
filesystem, since SIGSTOP can be used to get a similar effect.
I think these limitations are unacceptable?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
mounts, it can relax the last limitation with a "user_allow_other"
config option. If this config option is set, the mounting user can
add the "allow_other" mount option which disables the check for other
users' processes.
Kernel - userspace interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following diagram shows how a filesystem operation (in this
example unlink) is performed in FUSE.
NOTE: everything in this description is greatly simplified
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
| | >sys_read()
| | >fuse_dev_read()
| | >request_wait()
| | [sleep on fc->waitq]
| |
| >sys_unlink() |
| >fuse_unlink() |
| [get request from |
| fc->unused_list] |
| >request_send() |
| [queue req on fc->pending] |
| [wake up fc->waitq] | [woken up]
| >request_wait_answer() |
| [sleep on req->waitq] |
| | <request_wait()
| | [remove req from fc->pending]
| | [copy req to read buffer]
| | [add req to fc->processing]
| | <fuse_dev_read()
| | <sys_read()
| |
| | [perform unlink]
| |
| | >sys_write()
| | >fuse_dev_write()
| | [look up req in fc->processing]
| | [remove from fc->processing]
| | [copy write buffer to req]
| [woken up] | [wake up req->waitq]
| | <fuse_dev_write()
| | <sys_write()
| <request_wait_answer() |
| <request_send() |
| [add request to |
| fc->unused_list] |
| <fuse_unlink() |
| <sys_unlink() |
There are a couple of ways in which to deadlock a FUSE filesystem.
Since we are talking about unprivileged userspace programs,
something must be done about these.
Scenario 1 - Simple deadlock
-----------------------------
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
| >sys_unlink("/mnt/fuse/file") |
| [acquire inode semaphore |
| for "file"] |
| >fuse_unlink() |
| [sleep on req->waitq] |
| | <sys_read()
| | >sys_unlink("/mnt/fuse/file")
| | [acquire inode semaphore
| | for "file"]
| | *DEADLOCK*
The solution for this is to allow requests to be interrupted while
they are in userspace:
| [interrupted by signal] |
| <fuse_unlink() |
| [release semaphore] | [semaphore acquired]
| <sys_unlink() |
| | >fuse_unlink()
| | [queue req on fc->pending]
| | [wake up fc->waitq]
| | [sleep on req->waitq]
If the filesystem daemon was single threaded, this will stop here,
since there's no other thread to dequeue and execute the request.
In this case the solution is to kill the FUSE daemon as well. If
there are multiple serving threads, you just have to kill them as
long as any remain.
Moral: a filesystem which deadlocks, can soon find itself dead.
Scenario 2 - Tricky deadlock
----------------------------
This one needs a carefully crafted filesystem. It's a variation on
the above, only the call back to the filesystem is not explicit,
but is caused by a pagefault.
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
| |
| [fd = open("/mnt/fuse/file")] | [request served normally]
| [mmap fd to 'addr'] |
| [close fd] | [FLUSH triggers 'magic' flag]
| [read a byte from addr] |
| >do_page_fault() |
| [find or create page] |
| [lock page] |
| >fuse_readpage() |
| [queue READ request] |
| [sleep on req->waitq] |
| | [read request to buffer]
| | [create reply header before addr]
| | >sys_write(addr - headerlength)
| | >fuse_dev_write()
| | [look up req in fc->processing]
| | [remove from fc->processing]
| | [copy write buffer to req]
| | >do_page_fault()
| | [find or create page]
| | [lock page]
| | * DEADLOCK *
Solution is again to let the the request be interrupted (not
elaborated further).
An additional problem is that while the write buffer is being
copied to the request, the request must not be interrupted. This
is because the destination address of the copy may not be valid
after the request is interrupted.
This is solved with doing the copy atomically, and allowing
interruption while the page(s) belonging to the write buffer are
faulted with get_user_pages(). The 'req->locked' flag indicates
when the copy is taking place, and interruption is delayed until
this flag is unset.

View File

@ -439,6 +439,18 @@ ChangeLog
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
2.1.24:
- Support journals ($LogFile) which have been modified by chkdsk. This
means users can boot into Windows after we marked the volume dirty.
The Windows boot will run chkdsk and then reboot. The user can then
immediately boot into Linux rather than having to do a full Windows
boot first before rebooting into Linux and we will recognize such a
journal and empty it as it is clean by definition.
- Support journals ($LogFile) with only one restart page as well as
journals with two different restart pages. We sanity check both and
either use the only sane one or the more recent one of the two in the
case that both are valid.
- Lots of bug fixes and enhancements across the board.
2.1.23:
- Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
it is present and active thus telling Windows and applications using

View File

@ -133,6 +133,7 @@ Table 1-1: Process specific entries in /proc
statm Process memory status information
status Process status in human readable form
wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
smaps Extension based on maps, presenting the rss size for each mapped file
..............................................................................
For example, to get the status information of a process, all you have to do is
@ -1240,16 +1241,38 @@ swap-intensive.
overcommit_memory
-----------------
This file contains one value. The following algorithm is used to decide if
there's enough memory: if the value of overcommit_memory is positive, then
there's always enough memory. This is a useful feature, since programs often
malloc() huge amounts of memory 'just in case', while they only use a small
part of it. Leaving this value at 0 will lead to the failure of such a huge
malloc(), when in fact the system has enough memory for the program to run.
Controls overcommit of system memory, possibly allowing processes
to allocate (but not use) more memory than is actually available.
On the other hand, enabling this feature can cause you to run out of memory
and thrash the system to death, so large and/or important servers will want to
set this value to 0.
0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slighly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
applications.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap plus a
configurable percentage (default is 50) of physical RAM.
Depending on the percentage you use, in most situations
this means a process will not be killed while attempting
to use already-allocated memory but will receive errors
on memory allocation as appropriate.
overcommit_ratio
----------------
Percentage of physical memory size to include in overcommit calculations
(see above.)
Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
swapspace = total size of all swap areas
physmem = size of physical memory in system
nr_hugepages and hugetlb_shm_group
----------------------------------

View File

@ -0,0 +1,362 @@
relayfs - a high-speed data relay filesystem
============================================
relayfs is a filesystem designed to provide an efficient mechanism for
tools and facilities to relay large and potentially sustained streams
of data from kernel space to user space.
The main abstraction of relayfs is the 'channel'. A channel consists
of a set of per-cpu kernel buffers each represented by a file in the
relayfs filesystem. Kernel clients write into a channel using
efficient write functions which automatically log to the current cpu's
channel buffer. User space applications mmap() the per-cpu files and
retrieve the data as it becomes available.
The format of the data logged into the channel buffers is completely
up to the relayfs client; relayfs does however provide hooks which
allow clients to impose some stucture on the buffer data. Nor does
relayfs implement any form of data filtering - this also is left to
the client. The purpose is to keep relayfs as simple as possible.
This document provides an overview of the relayfs API. The details of
the function parameters are documented along with the functions in the
filesystem code - please see that for details.
Semantics
=========
Each relayfs channel has one buffer per CPU, each buffer has one or
more sub-buffers. Messages are written to the first sub-buffer until
it is too full to contain a new message, in which case it it is
written to the next (if available). Messages are never split across
sub-buffers. At this point, userspace can be notified so it empties
the first sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
bytes of it are padding i.e. unused. Userspace can use this knowledge
to copy only valid data.
After copying it, userspace can notify the kernel that a sub-buffer
has been consumed.
relayfs can operate in a mode where it will overwrite data not yet
collected by userspace, and not wait for it to consume it.
relayfs itself does not provide for communication of such data between
userspace and kernel, allowing the kernel side to remain simple and not
impose a single interface on userspace. It does provide a separate
helper though, described below.
klog, relay-app & librelay
==========================
relayfs itself is ready to use, but to make things easier, two
additional systems are provided. klog is a simple wrapper to make
writing formatted text or raw data to a channel simpler, regardless of
whether a channel to write into exists or not, or whether relayfs is
compiled into the kernel or is configured as a module. relay-app is
the kernel counterpart of userspace librelay.c, combined these two
files provide glue to easily stream data to disk, without having to
bother with housekeeping. klog and relay-app can be used together,
with klog providing high-level logging functions to the kernel and
relay-app taking care of kernel-user control and disk-logging chores.
It is possible to use relayfs without relay-app & librelay, but you'll
have to implement communication between userspace and kernel, allowing
both to convey the state of buffers (full, empty, amount of padding).
klog, relay-app and librelay can be found in the relay-apps tarball on
http://relayfs.sourceforge.net
The relayfs user space API
==========================
relayfs implements basic file operations for user space access to
relayfs channel buffer data. Here are the file operations that are
available and some comments regarding their behavior:
open() enables user to open an _existing_ buffer.
mmap() results in channel buffer being mapped into the caller's
memory space. Note that you can't do a partial mmap - you must
map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
'consumed' by the reader i.e. they won't be available again
to subsequent reads. If the channel is being used in
no-overwrite mode (the default), it can be read at any time
even if there's an active kernel writer. If the channel is
being used in overwrite mode and there are active channel
writers, results may be unpredictable - users should make
sure that all logging to the channel has ended before using
read() with overwrite mode.
poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
reaches 0 i.e. when no process or kernel client has the buffer
open, the channel buffer is freed.
In order for a user application to make use of relayfs files, the
relayfs filesystem must be mounted. For example,
mount -t relayfs relayfs /mnt/relay
NOTE: relayfs doesn't need to be mounted for kernel clients to create
or use channels - it only needs to be mounted when user space
applications need access to the buffer data.
The relayfs kernel API
======================
Here's a summary of the API relayfs provides to in-kernel clients:
channel management functions:
relay_open(base_filename, parent, subbuf_size, n_subbufs,
callbacks)
relay_close(chan)
relay_flush(chan)
relay_reset(chan)
relayfs_create_dir(name, parent)
relayfs_remove_dir(dentry)
channel management typically called on instigation of userspace:
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
write functions:
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
callbacks:
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
buf_unmapped(buf, filp)
helper functions:
relay_buf_full(buf)
subbuf_start_reserve(buf, length)
Creating a channel
------------------
relay_open() is used to create a channel, along with its per-cpu
channel buffers. Each channel buffer will have an associated file
created for it in the relayfs filesystem, which can be opened and
mmapped from user space if desired. The files are named
basename0...basenameN-1 where N is the number of online cpus, and by
default will be created in the root of the filesystem. If you want a
directory structure to contain your relayfs files, you can create it
with relayfs_create_dir() and pass the parent directory to
relay_open(). Clients are responsible for cleaning up any directory
structure they create when the channel is closed - use
relayfs_remove_dir() for that.
The total size of each per-cpu buffer is calculated by multiplying the
number of sub-buffers by the sub-buffer size passed into relay_open().
The idea behind sub-buffers is that they're basically an extension of
double-buffering to N buffers, and they also allow applications to
easily implement random-access-on-buffer-boundary schemes, which can
be important for some high-volume applications. The number and size
of sub-buffers is completely dependent on the application and even for
the same application, different conditions will warrant different
values for these parameters at different times. Typically, the right
values to use are best decided after some experimentation; in general,
though, it's safe to assume that having only 1 sub-buffer is a bad
idea - you're guaranteed to either overwrite data or lose events
depending on the channel mode being used.
Channel 'modes'
---------------
relayfs channels can be used in either of two modes - 'overwrite' or
'no-overwrite'. The mode is entirely determined by the implementation
of the subbuf_start() callback, as described below. In 'overwrite'
mode, also known as 'flight recorder' mode, writes continuously cycle
around the buffer and will never fail, but will unconditionally
overwrite old data regardless of whether it's actually been consumed.
In no-overwrite mode, writes will fail i.e. data will be lost, if the
number of unconsumed sub-buffers equals the total number of
sub-buffers in the channel. It should be clear that if there is no
consumer or if the consumer can't consume sub-buffers fast enought,
data will be lost in either case; the only difference is whether data
is lost from the beginning or the end of a buffer.
As explained above, a relayfs channel is made of up one or more
per-cpu channel buffers, each implemented as a circular buffer
subdivided into one or more sub-buffers. Messages are written into
the current sub-buffer of the channel's current per-cpu buffer via the
write functions described below. Whenever a message can't fit into
the current sub-buffer, because there's no room left for it, the
client is notified via the subbuf_start() callback that a switch to a
new sub-buffer is about to occur. The client uses this callback to 1)
initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually go ahead with the sub-buffer switch.
To implement 'no-overwrite' mode, the userspace client would provide
an implementation of the subbuf_start() callback something like the
following:
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
if (relay_buf_full(buf))
return 0;
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
If the current buffer is full i.e. all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
occur yet i.e. until the consumer has had a chance to read the current
set of ready sub-buffers. For the relay_buf_full() function to make
sense, the consumer is reponsible for notifying relayfs when
sub-buffers have been consumed via relay_subbufs_consumed(). Any
subsequent attempts to write into the buffer will again invoke the
subbuf_start() callback with the same parameters; only when the
consumer has consumed one or more of the ready sub-buffers will
relay_buf_full() return 0, in which case the buffer switch can
continue.
The implementation of the subbuf_start() callback for 'overwrite' mode
would be very similar:
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
In this case, the relay_buf_full() check is meaningless and the
callback always returns 1, causing the buffer switch to occur
unconditionally. It's also meaningless for the client to use the
relay_subbufs_consumed() function in this mode, as it's never
consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
implements the simplest possible 'no-overwrite' mode i.e. it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
by calling the subbuf_start_reserve() helper function from within the
subbuf_start() callback. This reserved area can be used to store
whatever information the client wants. In the example above, room is
reserved in each sub-buffer to store the padding count for that
sub-buffer. This is filled in for the previous sub-buffer in the
subbuf_start() implementation; the padding value for the previous
sub-buffer is passed into the subbuf_start() callback along with a
pointer to the previous sub-buffer, since the padding value isn't
known until a sub-buffer is filled. The subbuf_start() callback is
also called for the first sub-buffer when the channel is opened, to
give the client a chance to reserve space in it. In this case the
previous sub-buffer pointer passed into the callback will be NULL, so
the client should check the value of the prev_subbuf pointer before
writing into the previous sub-buffer.
Writing to a channel
--------------------
kernel clients write data into the current cpu's channel buffer using
relay_write() or __relay_write(). relay_write() is the main logging
function - it uses local_irqsave() to protect the buffer and should be
used if you might be logging from interrupt context. If you know
you'll never be logging from interrupt context, you can use
__relay_write(), which only disables preemption. These functions
don't return a value, so you can't determine whether or not they
failed - the assumption is that you wouldn't want to check a return
value in the fast logging path anyway, and that they'll always succeed
unless the buffer is full and no-overwrite mode is being used, in
which case you can detect a failed write in the subbuf_start()
callback by calling the relay_buf_full() helper function.
relay_reserve() is used to reserve a slot in a channel buffer which
can be written to later. This would typically be used in applications
that need to write directly into a channel buffer without having to
stage data in a temporary buffer beforehand. Because the actual write
may not happen immediately after the slot is reserved, applications
using relay_reserve() can keep a count of the number of bytes actually
written, either in space reserved in the sub-buffers themselves or as
a separate array. See the 'reserve' example in the relay-apps tarball
at http://relayfs.sourceforge.net for an example of how this can be
done. Because the write is under control of the client and is
separated from the reserve, relay_reserve() doesn't protect the buffer
at all - it's up to the client to provide the appropriate
synchronization when using relay_reserve().
Closing a channel
-----------------
The client calls relay_close() when it's finished using the channel.
The channel and its associated buffers are destroyed when there are no
longer any references to any of the channel buffers. relay_flush()
forces a sub-buffer switch on all the channel buffers, and can be used
to finalize and process the last sub-buffers before the channel is
closed.
Misc
----
Some applications may want to keep a channel around and re-use it
rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
do so i.e. when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
is mmapped from user space and buf_unmapped() is called when it's
unmapped. The client can use this notification to trigger actions
within the kernel application, such as enabling/disabling logging to
the channel.
Resources
=========
For news, example code, mailing list, etc. see the relayfs homepage:
http://relayfs.sourceforge.net
Credits
=======
The ideas and specs for relayfs came about as a result of discussions
on tracing involving the following:
Michel Dagenais <michel.dagenais@polymtl.ca>
Richard Moore <richardj_moore@uk.ibm.com>
Bob Wisniewski <bob@watson.ibm.com>
Karim Yaghmour <karim@opersys.com>
Tom Zanussi <zanussi@us.ibm.com>
Also thanks to Hubertus Franke for a lot of useful suggestions and bug
reports.

View File

@ -99,14 +99,14 @@ struct device_attribute dev_attr_##_name = { \
For example, declaring
static DEVICE_ATTR(foo,0644,show_foo,store_foo);
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
is equivalent to doing:
static struct device_attribute dev_attr_foo = {
.attr = {
.name = "foo",
.mode = 0644,
.mode = S_IWUSR | S_IRUGO,
},
.show = show_foo,
.store = store_foo,
@ -216,13 +216,13 @@ A very simple (and naive) implementation of a device attribute is:
static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
{
return sprintf(buf,"%s\n",dev->name);
return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
}
static ssize_t store_name(struct device * dev, const char * buf)
{
sscanf(buf, "%20s", dev->name);
return strlen(buf);
return strnlen(buf, PAGE_SIZE);
}
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);

View File

@ -0,0 +1,95 @@
V9FS: 9P2000 for Linux
======================
ABOUT
=====
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
This software was originally developed by Ron Minnich <rminnich@lanl.gov>
and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
<gwatson@lanl.gov> and most recently Eric Van Hensbergen
<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>.
USAGE
=====
For remote file server:
mount -t 9P 10.10.1.2 /mnt/9
For Plan 9 From User Space applications (http://swtch.com/plan9)
mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER
OPTIONS
=======
proto=name select an alternative transport. Valid options are
currently:
unix - specifying a named pipe mount point
tcp - specifying a normal TCP/IP connection
fd - used passed file descriptors for connection
(see rfdno and wfdno)
name=name user name to attempt mount as on the remote server. The
server may override or ignore this value. Certain user
names may require authentication.
aname=name aname specifies the file tree to access when the server is
offering several exported file systems.
debug=n specifies debug level. The debug level is a bitmask.
0x01 = display verbose error messages
0x02 = developer debug (DEBUG_CURRENT)
0x04 = display 9P trace
0x08 = display VFS trace
0x10 = display Marshalling debug
0x20 = display RPC debug
0x40 = display transport debug
0x80 = display allocation debug
rfdno=n the file descriptor for reading with proto=fd
wfdno=n the file descriptor for writing with proto=fd
maxdata=n the number of bytes to use for 9P packet payload (msize)
port=n port to connect to on the remote server
timeout=n request timeouts (in ms) (default 60000ms)
noextend force legacy mode (no 9P2000.u semantics)
uid attempt to mount as a particular uid
gid attempt to mount with a particular gid
afid security channel - used by Plan 9 authentication protocols
nodevmap do not map special files - represent them as normal files.
This can be used to share devices/named pipes/sockets between
hosts. This functionality will be expanded in later versions.
RESOURCES
=========
The Linux version of the 9P server, along with some client-side utilities
can be found at http://v9fs.sf.net (along with a CVS repository of the
development branch of this module). There are user and developer mailing
lists here, as well as a bug-tracker.
For more information on the Plan 9 Operating System check out
http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
STATUS
======
The 2.6 kernel support is working on PPC and x86.
PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS.

View File

@ -1,35 +1,27 @@
/* -*- auto-fill -*- */
Overview of the Virtual File System
Overview of the Linux Virtual File System
Richard Gooch <rgooch@atnf.csiro.au>
Original author: Richard Gooch <rgooch@atnf.csiro.au>
5-JUL-1999
Last updated on August 25, 2005
Copyright (C) 1999 Richard Gooch
Copyright (C) 2005 Pekka Enberg
This file is released under the GPLv2.
Conventions used in this document <section>
=================================
Each section in this document will have the string "<section>" at the
right-hand side of the section title. Each subsection will have
"<subsection>" at the right-hand side. These strings are meant to make
it easier to search through the document.
NOTE that the master copy of this document is available online at:
http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
What is it? <section>
What is it?
===========
The Virtual File System (otherwise known as the Virtual Filesystem
Switch) is the software layer in the kernel that provides the
filesystem interface to userspace programs. It also provides an
abstraction within the kernel which allows different filesystem
implementations to co-exist.
implementations to coexist.
A Quick Look At How It Works <section>
A Quick Look At How It Works
============================
In this section I'll briefly describe how things work, before
@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the
other view which is how a filesystem is supported and subsequently
mounted.
Opening a File <subsection>
Opening a File
--------------
The VFS implements the open(2), stat(2), chmod(2) and similar system
@ -77,7 +70,7 @@ back to userspace.
Opening a file requires another operation: allocation of a file
structure (this is the kernel-side implementation of file
descriptors). The freshly allocated file structure is initialised with
descriptors). The freshly allocated file structure is initialized with
a pointer to the dentry and a set of file operation member functions.
These are taken from the inode data. The open() file method is then
called so the specific filesystem implementation can do it's work. You
@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different
processors. You should ensure that access to shared resources is
protected by appropriate locks.
Registering and Mounting a Filesystem <subsection>
Registering and Mounting a Filesystem
-------------------------------------
If you want to support a new kind of filesystem in the kernel, all you
@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem.
It's now time to look at things in more detail.
struct file_system_type <section>
struct file_system_type
=======================
This describes the filesystem. As of kernel 2.1.99, the following
This describes the filesystem. As of kernel 2.6.13, the following
members are defined:
struct file_system_type {
const char *name;
int fs_flags;
struct super_block *(*read_super) (struct super_block *, void *, int);
struct super_block *(*get_sb) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
struct list_head fs_supers;
};
name: the name of the filesystem type, such as "ext2", "iso9660",
@ -141,51 +139,97 @@ struct file_system_type {
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
read_super: the method to call when a new instance of this
get_sb: the method to call when a new instance of this
filesystem should be mounted
next: for internal VFS use: you should initialise this to NULL
kill_sb: the method to call when an instance of this filesystem
should be unmounted
The read_super() method has the following arguments:
owner: for internal VFS use: you should initialize this to THIS_MODULE in
most cases.
next: for internal VFS use: you should initialize this to NULL
The get_sb() method has the following arguments:
struct super_block *sb: the superblock structure. This is partially
initialised by the VFS and the rest must be initialised by the
read_super() method
initialized by the VFS and the rest must be initialized by the
get_sb() method
int flags: mount flags
const char *dev_name: the device name we are mounting.
void *data: arbitrary mount options, usually comes as an ASCII
string
int silent: whether or not to be silent on error
The read_super() method must determine if the block device specified
The get_sb() method must determine if the block device specified
in the superblock contains a filesystem of the type the method
supports. On success the method returns the superblock pointer, on
failure it returns NULL.
The most interesting member of the superblock structure that the
read_super() method fills in is the "s_op" field. This is a pointer to
get_sb() method fills in is the "s_op" field. This is a pointer to
a "struct super_operations" which describes the next level of the
filesystem implementation.
Usually, a filesystem uses generic one of the generic get_sb()
implementations and provides a fill_super() method instead. The
generic methods are:
struct super_operations <section>
get_sb_bdev: mount a filesystem residing on a block device
get_sb_nodev: mount a filesystem that is not backed by a device
get_sb_single: mount a filesystem which shares the instance between
all mounts
A fill_super() method implementation has the following arguments:
struct super_block *sb: the superblock structure. The method fill_super()
must initialize this properly.
void *data: arbitrary mount options, usually comes as an ASCII
string
int silent: whether or not to be silent on error
struct super_operations
=======================
This describes how the VFS can manipulate the superblock of your
filesystem. As of kernel 2.1.99, the following members are defined:
filesystem. As of kernel 2.6.13, the following members are defined:
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*read_inode) (struct inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
int (*notify_change) (struct dentry *, struct iattr *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*statfs) (struct super_block *, struct statfs *, int);
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct super_block *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
void (*sync_inodes) (struct super_block *sb,
struct writeback_control *wbc);
int (*show_options)(struct seq_file *, struct vfsmount *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
};
All methods are called without any locks being held, unless otherwise
@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are
only called from a process context (i.e. not from an interrupt handler
or bottom half).
alloc_inode: this method is called by inode_alloc() to allocate memory
for struct inode and initialize it.
destroy_inode: this method is called by destroy_inode() to release
resources allocated for struct inode.
read_inode: this method is called to read a specific inode from the
mounted filesystem. The "i_ino" member in the "struct inode"
will be initialised by the VFS to indicate which inode to
read. Other members are filled in by this method
mounted filesystem. The i_ino member in the struct inode is
initialized by the VFS to indicate which inode to read. Other
members are filled in by this method.
You can set this to NULL and use iget5_locked() instead of iget()
to read inodes. This is necessary for filesystems for which the
inode number is not sufficient to identify an inode.
dirty_inode: this method is called by the VFS to mark an inode dirty.
write_inode: this method is called when the VFS needs to write an
inode to disc. The second parameter indicates whether the write
should be synchronous or not, not all filesystems check this flag.
put_inode: called when the VFS inode is removed from the inode
cache. This method is optional
cache.
drop_inode: called when the last access to the inode is dropped,
with the inode_lock spinlock held.
This method should be either NULL (normal unix filesystem
This method should be either NULL (normal UNIX filesystem
semantics) or "generic_delete_inode" (for filesystems that do not
want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
The "generic_delete_inode()" behaviour is equivalent to the
The "generic_delete_inode()" behavior is equivalent to the
old practice of using "force_delete" in the put_inode() case,
but does not have the races that the "force_delete()" approach
had.
delete_inode: called when the VFS wants to delete an inode
notify_change: called when VFS inode attributes are changed. If this
is NULL the VFS falls back to the write_inode() method. This
is called with the kernel lock held
put_super: called when the VFS wishes to free the superblock
(i.e. unmount). This is called with the superblock lock held
write_super: called when the VFS superblock needs to be written to
disc. This method is optional
sync_fs: called when VFS is writing out all dirty data associated with
a superblock. The second parameter indicates whether the method
should wait until the write out has been completed. Optional.
write_super_lockfs: called when VFS is locking a filesystem and forcing
it into a consistent state. This function is currently used by the
Logical Volume Manager (LVM).
unlockfs: called when VFS is unlocking a filesystem and making it writable
again.
statfs: called when the VFS needs to get filesystem statistics. This
is called with the kernel lock held
@ -238,21 +301,31 @@ or bottom half).
clear_inode: called then the VFS clears the inode. Optional
umount_begin: called when the VFS is unmounting a filesystem.
sync_inodes: called when the VFS is writing out dirty data associated with
a superblock.
show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
quota_read: called by the VFS to read from filesystem quota file.
quota_write: called by the VFS to write to filesystem quota file.
The read_inode() method is responsible for filling in the "i_op"
field. This is a pointer to a "struct inode_operations" which
describes the methods that can be performed on individual inodes.
struct inode_operations <section>
struct inode_operations
=======================
This describes how the VFS can manipulate an inode in your
filesystem. As of kernel 2.1.99, the following members are defined:
filesystem. As of kernel 2.6.13, the following members are defined:
struct inode_operations {
struct file_operations * default_file_ops;
int (*create) (struct inode *,struct dentry *,int);
int (*lookup) (struct inode *,struct dentry *);
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
@ -261,25 +334,22 @@ struct inode_operations {
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char *,int);
struct dentry * (*follow_link) (struct dentry *, struct dentry *);
int (*readpage) (struct file *, struct page *);
int (*writepage) (struct page *page, struct writeback_control *wbc);
int (*bmap) (struct inode *,int);
int (*readlink) (struct dentry *, char __user *,int);
void * (*follow_link) (struct dentry *, struct nameidata *);
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int);
int (*smap) (struct inode *,int);
int (*updatepage) (struct file *, struct page *, const char *,
unsigned long, unsigned int, int);
int (*revalidate) (struct dentry *);
int (*permission) (struct inode *, int, struct nameidata *);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
};
Again, all methods are called without any locks being held, unless
otherwise noted.
default_file_ops: this is a pointer to a "struct file_operations"
which describes how to open and then manipulate open files
create: called by the open(2) and creat(2) system calls. Only
required if you want to support regular files. The dentry you
get should not have an inode (i.e. it should be a negative
@ -329,30 +399,142 @@ otherwise noted.
follow_link: called by the VFS to follow a symbolic link to the
inode it points to. Only required if you want to support
symbolic links
symbolic links. This function returns a void pointer cookie
that is passed to put_link().
put_link: called by the VFS to release resources allocated by
follow_link(). The cookie returned by follow_link() is passed to
to this function as the last parameter. It is used by filesystems
such as NFS where page cache is not stable (i.e. page that was
installed when the symbolic link walk started might not be in the
page cache at the end of the walk).
truncate: called by the VFS to change the size of a file. The i_size
field of the inode is set to the desired size by the VFS before
this function is called. This function is called by the truncate(2)
system call and related functionality.
permission: called by the VFS to check for access rights on a POSIX-like
filesystem.
setattr: called by the VFS to set attributes for a file. This function is
called by chmod(2) and related system calls.
getattr: called by the VFS to get attributes of a file. This function is
called by stat(2) and related system calls.
setxattr: called by the VFS to set an extended attribute for a file.
Extended attribute is a name:value pair associated with an inode. This
function is called by setxattr(2) system call.
getxattr: called by the VFS to retrieve the value of an extended attribute
name. This function is called by getxattr(2) function call.
listxattr: called by the VFS to list all extended attributes for a given
file. This function is called by listxattr(2) system call.
removexattr: called by the VFS to remove an extended attribute from a file.
This function is called by removexattr(2) system call.
struct file_operations <section>
struct address_space_operations
===============================
This describes how the VFS can manipulate mapping of a file to page cache in
your filesystem. As of kernel 2.6.13, the following members are defined:
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
};
writepage: called by the VM write a dirty page to backing store.
readpage: called by the VM to read a page from backing store.
sync_page: called by the VM to notify the backing store to perform all
queued I/O operations for a page. I/O operations for other pages
associated with this address_space object may also be performed.
writepages: called by the VM to write out pages associated with the
address_space object.
set_page_dirty: called by the VM to set a page dirty.
readpages: called by the VM to read pages associated with the address_space
object.
prepare_write: called by the generic write path in VM to set up a write
request for a page.
commit_write: called by the generic write path in VM to write page to
its backing store.
bmap: called by the VFS to map a logical block offset within object to
physical block number. This method is use by for the legacy FIBMAP
ioctl. Other uses are discouraged.
invalidatepage: called by the VM on truncate to disassociate a page from its
address_space mapping.
releasepage: called by the VFS to release filesystem specific metadata from
a page.
direct_IO: called by the VM for direct I/O writes and reads.
get_xip_page: called by the VM to translate a block number to a page.
The page is valid until the corresponding filesystem is unmounted.
Filesystems that want to use execute-in-place (XIP) need to implement
it. An example implementation can be found in fs/ext2/xip.c.
struct file_operations
======================
This describes how the VFS can manipulate an open file. As of kernel
2.1.99, the following members are defined:
2.6.13, the following members are defined:
struct file_operations {
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *);
int (*fasync) (struct file *, int);
int (*check_media_change) (kdev_t dev);
int (*revalidate) (kdev_t dev);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*dir_notify)(struct file *filp, unsigned long arg);
int (*flock) (struct file *, int, struct file_lock *);
};
Again, all methods are called without any locks being held, unless
@ -362,8 +544,12 @@ otherwise noted.
read: called by read(2) and related system calls
aio_read: called by io_submit(2) and other asynchronous I/O operations
write: called by write(2) and related system calls
aio_write: called by io_submit(2) and other asynchronous I/O operations
readdir: called when the VFS needs to read the directory contents
poll: called by the VFS when a process wants to check if there is
@ -372,18 +558,25 @@ otherwise noted.
ioctl: called by the ioctl(2) system call
unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
require the BKL should use this method instead of the ioctl() above.
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
are used on 64 bit kernels.
mmap: called by the mmap(2) system call
open: called by the VFS when an inode should be opened. When the VFS
opens a file, it creates a new "struct file" and initialises
the "f_op" file operations member with the "default_file_ops"
field in the inode structure. It then calls the open method
for the newly allocated file structure. You might think that
the open method really belongs in "struct inode_operations",
and you may be right. I think it's done the way it is because
it makes filesystems simpler to implement. The open() method
is a good place to initialise the "private_data" member in the
file structure if you want to point to a device structure
opens a file, it creates a new "struct file". It then calls the
open method for the newly allocated file structure. You might
think that the open method really belongs in
"struct inode_operations", and you may be right. I think it's
done the way it is because it makes filesystems simpler to
implement. The open() method is a good place to initialize the
"private_data" member in the file structure if you want to point
to a device structure
flush: called by the close(2) system call to flush a file
release: called when the last reference to an open file is closed
@ -392,6 +585,23 @@ otherwise noted.
fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
commands
readv: called by the readv(2) system call
writev: called by the writev(2) system call
sendfile: called by the sendfile(2) system call
get_unmapped_area: called by the mmap(2) system call
check_flags: called by the fcntl(2) system call for F_SETFL command
dir_notify: called by the fcntl(2) system call for F_NOTIFY command
flock: called by the flock(2) system call
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file
operations with those for the device driver, and then proceed to call
the new open() method for the file. This is how opening a device file
in the filesystem eventually ends up calling the device driver open()
method. Note the devfs (the Device FileSystem) has a more direct path
from device node to device driver (this is an unofficial kernel
patch).
method.
Directory Entry Cache (dcache) <section>
------------------------------
Directory Entry Cache (dcache)
==============================
struct dentry_operations
========================
------------------------
This describes how a filesystem can overload the standard dentry
operations. Dentries and the dcache are the domain of the VFS and the
individual filesystem implementations. Device drivers have no business
here. These methods may be set to NULL, as they are either optional or
the VFS uses a default. As of kernel 2.1.99, the following members are
the VFS uses a default. As of kernel 2.6.13, the following members are
defined:
struct dentry_operations {
int (*d_revalidate)(struct dentry *);
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
void (*d_delete)(struct dentry *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
};
@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
directory.
Directory Entry Cache APIs
--------------------------
@ -471,7 +681,7 @@ manipulate dentries:
"d_delete" method is called
d_drop: this unhashes a dentry from its parents hash list. A
subsequent call to dput() will dellocate the dentry if its
subsequent call to dput() will deallocate the dentry if its
usage count drops to 0
d_delete: delete a dentry. If there are no other open references to
@ -512,11 +722,11 @@ this path.
Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
in every component during path look-up. Since 2.5.10 onwards,
fastwalk algorithm changed this by holding the dcache_lock
fast-walk algorithm changed this by holding the dcache_lock
at the beginning and walking as many cached path component
dentries as possible. This signficantly decreases the number
dentries as possible. This significantly decreases the number
of acquisition of dcache_lock. However it also increases the
lock hold time signficantly and affects performance in large
lock hold time significantly and affects performance in large
SMP machines. Since 2.5.62 kernel, dcache has been using
a new locking model that uses RCU to make dcache look-up
lock-free.
@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well
as d_inode and several other things like mount look-up. RCU-based
changes affect only the way the hash chain is protected. For everything
else the dcache_lock must be taken for both traversing as well as
updating. The hash chain updations too take the dcache_lock.
updating. The hash chain updates too take the dcache_lock.
The significant change is the way d_lookup traverses the hash chain,
it doesn't acquire the dcache_lock for this and rely on RCU to
ensure that the dentry has not been *freed*.
@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*.
Dcache locking details
----------------------
For many multi-user workloads, open() and stat() on files are
very frequently occurring operations. Both involve walking
of path names to find the dentry corresponding to the
concerned file. In 2.4 kernel, dcache_lock was held
during look-up of each path component. Contention and
cacheline bouncing of this global lock caused significant
cache-line bouncing of this global lock caused significant
scalability problems. With the introduction of RCU
in linux kernel, this was worked around by making
in Linux kernel, this was worked around by making
the look-up of path components during path walking lock-free.
@ -562,7 +773,7 @@ Some of the important changes are :
2. Insertion of a dentry into the hash table is done using
hlist_add_head_rcu() which take care of ordering the writes -
the writes to the dentry must be visible before the dentry
is inserted. This works in conjuction with hlist_for_each_rcu()
is inserted. This works in conjunction with hlist_for_each_rcu()
while walking the hash chain. The only requirement is that
all initialization to the dentry must be done before hlist_add_head_rcu()
since we don't have dcache_lock protection while traversing
@ -584,7 +795,7 @@ Some of the important changes are :
the same. In some sense, dcache_rcu path walking looks like
the pre-2.5.10 version.
5. All dentry hash chain updations must take the dcache_lock as well as
5. All dentry hash chain updates must take the dcache_lock as well as
the per-dentry lock in that order. dput() does this to ensure
that a dentry that has just been looked up in another CPU
doesn't get deleted before dget() can be done on it.
@ -640,10 +851,10 @@ handled as described below :
Since we redo the d_parent check and compare name while holding
d_lock, lock-free look-up will not race against d_move().
4. There can be a theoritical race when a dentry keeps coming back
4. There can be a theoretical race when a dentry keeps coming back
to original bucket due to double moves. Due to this look-up may
consider that it has never moved and can end up in a infinite loop.
But this is not any worse that theoritical livelocks we already
But this is not any worse that theoretical livelocks we already
have in the kernel.

View File

@ -32,14 +32,14 @@ static void sample_firmware_load(char *firmware, int size)
u8 buf[size+1];
memcpy(buf, firmware, size);
buf[size] = '\0';
printk("firmware_sample_driver: firmware: %s\n", buf);
printk(KERN_INFO "firmware_sample_driver: firmware: %s\n", buf);
}
static void sample_probe_default(void)
{
/* uses the default method to get the firmware */
const struct firmware *fw_entry;
printk("firmware_sample_driver: a ghost device got inserted :)\n");
printk(KERN_INFO "firmware_sample_driver: a ghost device got inserted :)\n");
if(request_firmware(&fw_entry, "sample_driver_fw", &ghost_device)!=0)
{
@ -61,7 +61,7 @@ static void sample_probe_specific(void)
/* NOTE: This currently doesn't work */
printk("firmware_sample_driver: a ghost device got inserted :)\n");
printk(KERN_INFO "firmware_sample_driver: a ghost device got inserted :)\n");
if(request_firmware(NULL, "sample_driver_fw", &ghost_device)!=0)
{
@ -83,7 +83,7 @@ static void sample_probe_async_cont(const struct firmware *fw, void *context)
return;
}
printk("firmware_sample_driver: device pointer \"%s\"\n",
printk(KERN_INFO "firmware_sample_driver: device pointer \"%s\"\n",
(char *)context);
sample_firmware_load(fw->data, fw->size);
}

View File

@ -2,16 +2,11 @@ Kernel driver lm78
==================
Supported chips:
* National Semiconductor LM78
* National Semiconductor LM78 / LM78-J
Prefix: 'lm78'
Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports)
Datasheet: Publicly available at the National Semiconductor website
http://www.national.com/
* National Semiconductor LM78-J
Prefix: 'lm78-j'
Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports)
Datasheet: Publicly available at the National Semiconductor website
http://www.national.com/
* National Semiconductor LM79
Prefix: 'lm79'
Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports)

174
Documentation/hwmon/w83792d Normal file
View File

@ -0,0 +1,174 @@
Kernel driver w83792d
=====================
Supported chips:
* Winbond W83792D
Prefix: 'w83792d'
Addresses scanned: I2C 0x2c - 0x2f
Datasheet: http://www.winbond.com.tw/E-WINBONDHTM/partner/PDFresult.asp?Pname=1035
Author: Chunhao Huang
Contact: DZShen <DZShen@Winbond.com.tw>
Module Parameters
-----------------
* init int
(default 1)
Use 'init=0' to bypass initializing the chip.
Try this if your computer crashes when you load the module.
* force_subclients=bus,caddr,saddr,saddr
This is used to force the i2c addresses for subclients of
a certain chip. Example usage is `force_subclients=0,0x2f,0x4a,0x4b'
to force the subclients of chip 0x2f on bus 0 to i2c addresses
0x4a and 0x4b.
Description
-----------
This driver implements support for the Winbond W83792AD/D.
Detection of the chip can sometimes be foiled because it can be in an
internal state that allows no clean access (Bank with ID register is not
currently selected). If you know the address of the chip, use a 'force'
parameter; this will put it into a more well-behaved state first.
The driver implements three temperature sensors, seven fan rotation speed
sensors, nine voltage sensors, and two automatic fan regulation
strategies called: Smart Fan I (Thermal Cruise mode) and Smart Fan II.
Automatic fan control mode is possible only for fan1-fan3. Fan4-fan7 can run
synchronized with selected fan (fan1-fan3). This functionality and manual PWM
control for fan4-fan7 is not yet implemented.
Temperatures are measured in degrees Celsius and measurement resolution is 1
degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when
the temperature gets higher than the Overtemperature Shutdown value; it stays
on until the temperature falls below the Hysteresis value.
Fan rotation speeds are reported in RPM (rotations per minute). An alarm is
triggered if the rotation speed has dropped below a programmable limit. Fan
readings can be divided by a programmable divider (1, 2, 4, 8, 16, 32, 64 or
128) to give the readings more range or accuracy.
Voltage sensors (also known as IN sensors) report their values in millivolts.
An alarm is triggered if the voltage has crossed a programmable minimum
or maximum limit.
Alarms are provided as output from "realtime status register". Following bits
are defined:
bit - alarm on:
0 - in0
1 - in1
2 - temp1
3 - temp2
4 - temp3
5 - fan1
6 - fan2
7 - fan3
8 - in2
9 - in3
10 - in4
11 - in5
12 - in6
13 - VID change
14 - chassis
15 - fan7
16 - tart1
17 - tart2
18 - tart3
19 - in7
20 - in8
21 - fan4
22 - fan5
23 - fan6
Tart will be asserted while target temperature cannot be achieved after 3 minutes
of full speed rotation of corresponding fan.
In addition to the alarms described above, there is a CHAS alarm on the chips
which triggers if your computer case is open (This one is latched, contrary
to realtime alarms).
The chips only update values each 3 seconds; reading them more often will
do no harm, but will return 'old' values.
W83792D PROBLEMS
----------------
Known problems:
- This driver is only for Winbond W83792D C version device, there
are also some motherboards with B version W83792D device. The
calculation method to in6-in7(measured value, limits) is a little
different between C and B version. C or B version can be identified
by CR[0x49h].
- The function of vid and vrm has not been finished, because I'm NOT
very familiar with them. Adding support is welcome.
  - The function of chassis open detection needs more tests.
- If you have ASUS server board and chip was not found: Then you will
need to upgrade to latest (or beta) BIOS. If it does not help please
contact us.
Fan control
-----------
Manual mode
-----------
Works as expected. You just need to specify desired PWM/DC value (fan speed)
in appropriate pwm# file.
Thermal cruise
--------------
In this mode, W83792D provides the Smart Fan system to automatically control
fan speed to keep the temperatures of CPU and the system within specific
range. At first a wanted temperature and interval must be set. This is done
via thermal_cruise# file. The tolerance# file serves to create T +- tolerance
interval. The fan speed will be lowered as long as the current temperature
remains below the thermal_cruise# +- tolerance# value. Once the temperature
exceeds the high limit (T+tolerance), the fan will be turned on with a
specific speed set by pwm# and automatically controlled its PWM duty cycle
with the temperature varying. Three conditions may occur:
(1) If the temperature still exceeds the high limit, PWM duty
cycle will increase slowly.
(2) If the temperature goes below the high limit, but still above the low
limit (T-tolerance), the fan speed will be fixed at the current speed because
the temperature is in the target range.
(3) If the temperature goes below the low limit, PWM duty cycle will decrease
slowly to 0 or a preset stop value until the temperature exceeds the low
limit. (The preset stop value handling is not yet implemented in driver)
Smart Fan II
------------
W83792D also provides a special mode for fan. Four temperature points are
available. When related temperature sensors detects the temperature in preset
temperature region (sf2_point@_fan# +- tolerance#) it will cause fans to run
on programmed value from sf2_level@_fan#. You need to set four temperatures
for each fan.
/sys files
----------
pwm[1-3] - this file stores PWM duty cycle or DC value (fan speed) in range:
0 (stop) to 255 (full)
pwm[1-3]_enable - this file controls mode of fan/temperature control:
* 0 Disabled
* 1 Manual mode
* 2 Smart Fan II
* 3 Thermal Cruise
pwm[1-3]_mode - Select PWM of DC mode
* 0 DC
* 1 PWM
thermal_cruise[1-3] - Selects the desired temperature for cruise (degC)
tolerance[1-3] - Value in degrees of Celsius (degC) for +- T
sf2_point[1-4]_fan[1-3] - four temperature points for each fan for Smart Fan II
sf2_level[1-3]_fan[1-3] - three PWM/DC levels for each fan for Smart Fan II

View File

@ -4,22 +4,13 @@ Kernel driver max6875
Supported chips:
* Maxim MAX6874, MAX6875
Prefix: 'max6875'
Addresses scanned: 0x50, 0x52
Addresses scanned: None (see below)
Datasheet:
http://pdfserv.maxim-ic.com/en/ds/MAX6874-MAX6875.pdf
Author: Ben Gardner <bgardner@wabtec.com>
Module Parameters
-----------------
* allow_write int
Set to non-zero to enable write permission:
*0: Read only
1: Read and write
Description
-----------
@ -33,34 +24,85 @@ registers.
The Maxim MAX6874 is a similar, mostly compatible device, with more intputs
and outputs:
vin gpi vout
MAX6874 6 4 8
MAX6875 4 3 5
MAX6874 chips can have four different addresses (as opposed to only two for
the MAX6875). The additional addresses (0x54 and 0x56) are not probed by
this driver by default, but the probe module parameter can be used if
needed.
See the datasheet for details on how to program the EEPROM.
See the datasheet for more information.
Sysfs entries
-------------
eeprom_user - 512 bytes of user-defined EEPROM space. Only writable if
allow_write was set and register 0x43 is 0.
eeprom_config - 70 bytes of config EEPROM. Note that changes will not get
loaded into register space until a power cycle or device reset.
reg_config - 70 bytes of register space. Any changes take affect immediately.
eeprom - 512 bytes of user-defined EEPROM space.
General Remarks
---------------
A typical application will require that the EEPROMs be programmed once and
never altered afterwards.
Valid addresses for the MAX6875 are 0x50 and 0x52.
Valid addresses for the MAX6874 are 0x50, 0x52, 0x54 and 0x56.
The driver does not probe any address, so you must force the address.
Example:
$ modprobe max6875 force=0,0x50
The MAX6874/MAX6875 ignores address bit 0, so this driver attaches to multiple
addresses. For example, for address 0x50, it also reserves 0x51.
The even-address instance is called 'max6875', the odd one is 'max6875 subclient'.
Programming the chip using i2c-dev
----------------------------------
Use the i2c-dev interface to access and program the chips.
Reads and writes are performed differently depending on the address range.
The configuration registers are at addresses 0x00 - 0x45.
Use i2c_smbus_write_byte_data() to write a register and
i2c_smbus_read_byte_data() to read a register.
The command is the register number.
Examples:
To write a 1 to register 0x45:
i2c_smbus_write_byte_data(fd, 0x45, 1);
To read register 0x45:
value = i2c_smbus_read_byte_data(fd, 0x45);
The configuration EEPROM is at addresses 0x8000 - 0x8045.
The user EEPROM is at addresses 0x8100 - 0x82ff.
Use i2c_smbus_write_word_data() to write a byte to EEPROM.
The command is the upper byte of the address: 0x80, 0x81, or 0x82.
The data word is the lower part of the address or'd with data << 8.
cmd = address >> 8;
val = (address & 0xff) | (data << 8);
Example:
To write 0x5a to address 0x8003:
i2c_smbus_write_word_data(fd, 0x80, 0x5a03);
Reading data from the EEPROM is a little more complicated.
Use i2c_smbus_write_byte_data() to set the read address and then
i2c_smbus_read_byte() or i2c_smbus_read_i2c_block_data() to read the data.
Example:
To read data starting at offset 0x8100, first set the address:
i2c_smbus_write_byte_data(fd, 0x81, 0x00);
And then read the data
value = i2c_smbus_read_byte(fd);
or
count = i2c_smbus_read_i2c_block_data(fd, 0x84, buffer);
The block read should read 16 bytes.
0x84 is the block read command.
See the datasheet for more details.

View File

@ -115,7 +115,7 @@ CHECKING THROUGH /DEV
If you try to access an adapter from a userspace program, you will have
to use the /dev interface. You will still have to check whether the
functionality you need is supported, of course. This is done using
the I2C_FUNCS ioctl. An example, adapted from the lm_sensors i2c_detect
the I2C_FUNCS ioctl. An example, adapted from the lm_sensors i2cdetect
program, is below:
int file;

View File

@ -1,4 +1,4 @@
Revision 4, 2004-03-30
Revision 5, 2005-07-29
Jean Delvare <khali@linux-fr.org>
Greg KH <greg@kroah.com>
@ -17,20 +17,22 @@ yours for best results.
Technical changes:
* [Includes] Get rid of "version.h". Replace <linux/i2c-proc.h> with
<linux/i2c-sensor.h>. Includes typically look like that:
* [Includes] Get rid of "version.h" and <linux/i2c-proc.h>.
Includes typically look like that:
#include <linux/module.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/i2c.h>
#include <linux/i2c-sensor.h>
#include <linux/i2c-vid.h> /* if you need VRM support */
#include <linux/hwmon.h> /* for hardware monitoring drivers */
#include <linux/hwmon-sysfs.h>
#include <linux/hwmon-vid.h> /* if you need VRM support */
#include <asm/io.h> /* if you have I/O operations */
Please respect this inclusion order. Some extra headers may be
required for a given driver (e.g. "lm75.h").
* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, SENSORS_ISA_END
becomes I2C_CLIENT_ISA_END.
* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, ISA addresses
are no more handled by the i2c core.
SENSORS_INSMOD_<n> becomes I2C_CLIENT_INSMOD_<n>.
* [Client data] Get rid of sysctl_id. Try using standard names for
register values (for example, temp_os becomes temp_max). You're
@ -66,13 +68,15 @@ Technical changes:
if (!(adapter->class & I2C_CLASS_HWMON))
return 0;
ISA-only drivers of course don't need this.
Call i2c_probe() instead of i2c_detect().
* [Detect] As mentioned earlier, the flags parameter is gone.
The type_name and client_name strings are replaced by a single
name string, which will be filled with a lowercase, short string
(typically the driver name, e.g. "lm75").
In i2c-only drivers, drop the i2c_is_isa_adapter check, it's
useless.
useless. Same for isa-only drivers, as the test would always be
true. Only hybrid drivers (which are quite rare) still need it.
The errorN labels are reduced to the number needed. If that number
is 2 (i2c-only drivers), it is advised that the labels are named
exit and exit_free. For i2c+isa drivers, labels should be named
@ -86,6 +90,8 @@ Technical changes:
device_create_file. Move the driver initialization before any
sysfs file creation.
Drop client->id.
Drop any 24RF08 corruption prevention you find, as this is now done
at the i2c-core level, and doing it twice voids it.
* [Init] Limits must not be set by the driver (can be done later in
user-space). Chip should not be reset default (although a module
@ -93,7 +99,8 @@ Technical changes:
limited to the strictly necessary steps.
* [Detach] Get rid of data, remove the call to
i2c_deregister_entry.
i2c_deregister_entry. Do not log an error message if
i2c_detach_client fails, as i2c-core will now do it for you.
* [Update] Don't access client->data directly, use
i2c_get_clientdata(client) instead.

View File

@ -148,15 +148,15 @@ are defined in i2c.h to help you support them, as well as a generic
detection algorithm.
You do not have to use this parameter interface; but don't try to use
function i2c_probe() (or i2c_detect()) if you don't.
function i2c_probe() if you don't.
NOTE: If you want to write a `sensors' driver, the interface is slightly
different! See below.
Probing classes (i2c)
---------------------
Probing classes
---------------
All parameters are given as lists of unsigned 16-bit integers. Lists are
terminated by I2C_CLIENT_END.
@ -171,12 +171,18 @@ The following lists are used internally:
ignore: insmod parameter.
A list of pairs. The first value is a bus number (-1 for any I2C bus),
the second is the I2C address. These addresses are never probed.
This parameter overrules 'normal' and 'probe', but not the 'force' lists.
This parameter overrules the 'normal_i2c' list only.
force: insmod parameter.
A list of pairs. The first value is a bus number (-1 for any I2C bus),
the second is the I2C address. A device is blindly assumed to be on
the given address, no probing is done.
Additionally, kind-specific force lists may optionally be defined if
the driver supports several chip kinds. They are grouped in a
NULL-terminated list of pointers named forces, those first element if the
generic force list mentioned above. Each additional list correspond to an
insmod parameter of the form force_<kind>.
Fortunately, as a module writer, you just have to define the `normal_i2c'
parameter. The complete declaration could look like this:
@ -186,66 +192,17 @@ parameter. The complete declaration could look like this:
/* Magic definition of all other variables and things */
I2C_CLIENT_INSMOD;
/* Or, if your driver supports, say, 2 kind of devices: */
I2C_CLIENT_INSMOD_2(foo, bar);
If you use the multi-kind form, an enum will be defined for you:
enum chips { any_chip, foo, bar, ... }
You can then (and certainly should) use it in the driver code.
Note that you *have* to call the defined variable `normal_i2c',
without any prefix!
Probing classes (sensors)
-------------------------
If you write a `sensors' driver, you use a slightly different interface.
As well as I2C addresses, we have to cope with ISA addresses. Also, we
use a enum of chip types. Don't forget to include `sensors.h'.
The following lists are used internally. They are all lists of integers.
normal_i2c: filled in by the module writer. Terminated by SENSORS_I2C_END.
A list of I2C addresses which should normally be examined.
normal_isa: filled in by the module writer. Terminated by SENSORS_ISA_END.
A list of ISA addresses which should normally be examined.
probe: insmod parameter. Initialize this list with SENSORS_I2C_END values.
A list of pairs. The first value is a bus number (SENSORS_ISA_BUS for
the ISA bus, -1 for any I2C bus), the second is the address. These
addresses are also probed, as if they were in the 'normal' list.
ignore: insmod parameter. Initialize this list with SENSORS_I2C_END values.
A list of pairs. The first value is a bus number (SENSORS_ISA_BUS for
the ISA bus, -1 for any I2C bus), the second is the I2C address. These
addresses are never probed. This parameter overrules 'normal' and
'probe', but not the 'force' lists.
Also used is a list of pointers to sensors_force_data structures:
force_data: insmod parameters. A list, ending with an element of which
the force field is NULL.
Each element contains the type of chip and a list of pairs.
The first value is a bus number (SENSORS_ISA_BUS for the ISA bus,
-1 for any I2C bus), the second is the address.
These are automatically translated to insmod variables of the form
force_foo.
So we have a generic insmod variabled `force', and chip-specific variables
`force_CHIPNAME'.
Fortunately, as a module writer, you just have to define the `normal_i2c'
and `normal_isa' parameters, and define what chip names are used.
The complete declaration could look like this:
/* Scan i2c addresses 0x37, and 0x48 to 0x4f */
static unsigned short normal_i2c[] = { 0x37, 0x48, 0x49, 0x4a, 0x4b, 0x4c,
0x4d, 0x4e, 0x4f, I2C_CLIENT_END };
/* Scan ISA address 0x290 */
static unsigned int normal_isa[] = {0x0290,SENSORS_ISA_END};
/* Define chips foo and bar, as well as all module parameters and things */
SENSORS_INSMOD_2(foo,bar);
If you have one chip, you use macro SENSORS_INSMOD_1(chip), if you have 2
you use macro SENSORS_INSMOD_2(chip1,chip2), etc. If you do not want to
bother with chip types, you can use SENSORS_INSMOD_0.
A enum is automatically defined as follows:
enum chips { any_chip, chip1, chip2, ... }
Attaching to an adapter
-----------------------
@ -264,17 +221,10 @@ detected at a specific address, another callback is called.
return i2c_probe(adapter,&addr_data,&foo_detect_client);
}
For `sensors' drivers, use the i2c_detect function instead:
int foo_attach_adapter(struct i2c_adapter *adapter)
{
return i2c_detect(adapter,&addr_data,&foo_detect_client);
}
Remember, structure `addr_data' is defined by the macros explained above,
so you do not have to define it yourself.
The i2c_probe or i2c_detect function will call the foo_detect_client
The i2c_probe function will call the foo_detect_client
function only for those i2c addresses that actually have a device on
them (unless a `force' parameter was used). In addition, addresses that
are already in use (by some other registered client) are skipped.
@ -283,19 +233,18 @@ are already in use (by some other registered client) are skipped.
The detect client function
--------------------------
The detect client function is called by i2c_probe or i2c_detect.
The `kind' parameter contains 0 if this call is due to a `force'
parameter, and -1 otherwise (for i2c_detect, it contains 0 if
this call is due to the generic `force' parameter, and the chip type
number if it is due to a specific `force' parameter).
The detect client function is called by i2c_probe. The `kind' parameter
contains -1 for a probed detection, 0 for a forced detection, or a positive
number for a forced detection with a chip type forced.
Below, some things are only needed if this is a `sensors' driver. Those
parts are between /* SENSORS ONLY START */ and /* SENSORS ONLY END */
markers.
This function should only return an error (any value != 0) if there is
some reason why no more detection should be done anymore. If the
detection just fails for this address, return 0.
Returning an error different from -ENODEV in a detect function will cause
the detection to stop: other addresses and adapters won't be scanned.
This should only be done on fatal or internal errors, such as a memory
shortage or i2c_attach_client failing.
For now, you can ignore the `flags' parameter. It is there for future use.
@ -320,11 +269,10 @@ For now, you can ignore the `flags' parameter. It is there for future use.
const char *type_name = "";
int is_isa = i2c_is_isa_adapter(adapter);
if (is_isa) {
/* Do this only if the chip can additionally be found on the ISA bus
(hybrid chip). */
/* If this client can't be on the ISA bus at all, we can stop now
(call `goto ERROR0'). But for kicks, we will assume it is all
right. */
if (is_isa) {
/* Discard immediately if this ISA range is already used */
if (check_region(address,FOO_EXTENT))
@ -495,15 +443,13 @@ much simpler than the attachment code, fortunately!
/* SENSORS ONLY END */
/* Try to detach the client from i2c space */
if ((err = i2c_detach_client(client))) {
printk("foo.o: Client deregistration failed, client not detached.\n");
if ((err = i2c_detach_client(client)))
return err;
}
/* SENSORS ONLY START */
/* HYBRID SENSORS CHIP ONLY START */
if i2c_is_isa_client(client)
release_region(client->addr,LM78_EXTENT);
/* SENSORS ONLY END */
/* HYBRID SENSORS CHIP ONLY END */
kfree(client); /* Frees client data too, if allocated at the same time */
return 0;

View File

@ -2,7 +2,7 @@
----------------------------
H. Peter Anvin <hpa@zytor.com>
Last update 2002-01-01
Last update 2005-09-02
On the i386 platform, the Linux kernel uses a rather complicated boot
convention. This has evolved partially due to historical aspects, as
@ -34,6 +34,8 @@ Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol.
Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible
initrd address available to the bootloader.
Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes.
**** MEMORY LAYOUT
@ -103,10 +105,9 @@ The header looks like:
Offset Proto Name Meaning
/Size
01F1/1 ALL setup_sects The size of the setup in sectors
01F1/1 ALL(1 setup_sects The size of the setup in sectors
01F2/2 ALL root_flags If set, the root is mounted readonly
01F4/2 ALL syssize DO NOT USE - for bootsect.S use only
01F6/2 ALL swap_dev DO NOT USE - obsolete
01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
01FA/2 ALL vid_mode Video mode control
01FC/2 ALL root_dev Default root device number
@ -129,9 +130,13 @@ Offset Proto Name Meaning
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
022C/4 2.03+ initrd_addr_max Highest legal initrd address
For backwards compatibility, if the setup_sects field contains 0, the
(1) For backwards compatibility, if the setup_sects field contains 0, the
real value is 4.
(2) For boot protocol prior to 2.04, the upper two bytes of the syssize
field are unusable, which means the size of a bzImage kernel
cannot be determined.
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
the boot protocol version is "old". Loading an old kernel, the
following parameters should be assumed:
@ -230,12 +235,16 @@ loader to communicate with the kernel. Some of its options are also
relevant to the boot loader itself, see "special command line options"
below.
The kernel command line is a null-terminated string up to 255
characters long, plus the final null.
The kernel command line is a null-terminated string currently up to
255 characters long, plus the final null. A string that is too long
will be automatically truncated by the kernel, a boot loader may allow
a longer command line to be passed to permit future kernels to extend
this limit.
If the boot protocol version is 2.02 or later, the address of the
kernel command line is given by the header field cmd_line_ptr (see
above.)
above.) This address can be anywhere between the end of the setup
heap and 0xA0000.
If the protocol version is *not* 2.02 or higher, the kernel
command line is entered using the following protocol:
@ -255,7 +264,7 @@ command line is entered using the following protocol:
**** SAMPLE BOOT CONFIGURATION
As a sample configuration, assume the following layout of the real
mode segment:
mode segment (this is a typical, and recommended layout):
0x0000-0x7FFF Real mode kernel
0x8000-0x8FFF Stack and heap
@ -312,9 +321,9 @@ Such a boot loader should enter the following fields in the header:
**** LOADING THE REST OF THE KERNEL
The non-real-mode kernel starts at offset (setup_sects+1)*512 in the
kernel file (again, if setup_sects == 0 the real value is 4.) It
should be loaded at address 0x10000 for Image/zImage kernels and
The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
in the kernel file (again, if setup_sects == 0 the real value is 4.)
It should be loaded at address 0x10000 for Image/zImage kernels and
0x100000 for bzImage kernels.
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01

View File

@ -1,16 +1,16 @@
IBM ThinkPad ACPI Extras Driver
Version 0.8
8 November 2004
Version 0.12
17 August 2005
Borislav Deianov <borislav@users.sf.net>
http://ibm-acpi.sf.net/
This is a Linux ACPI driver for the IBM ThinkPad laptops. It aims to
support various features of these laptops which are accessible through
the ACPI framework but not otherwise supported by the generic Linux
ACPI drivers.
This is a Linux ACPI driver for the IBM ThinkPad laptops. It supports
various features of these laptops which are accessible through the
ACPI framework but not otherwise supported by the generic Linux ACPI
drivers.
Status
@ -25,9 +25,14 @@ detailed description):
- ThinkLight on and off
- limited docking and undocking
- UltraBay eject
- Experimental: CMOS control
- Experimental: LED control
- Experimental: ACPI sounds
- CMOS control
- LED control
- ACPI sounds
- temperature sensors
- Experimental: embedded controller register dump
- Experimental: LCD brightness control
- Experimental: volume control
- Experimental: fan speed, fan enable/disable
A compatibility table by model and feature is maintained on the web
site, http://ibm-acpi.sf.net/. I appreciate any success or failure
@ -91,12 +96,12 @@ driver is still in the alpha stage, the exact proc file format and
commands supported by the various features is guaranteed to change
frequently.
Driver Version -- /proc/acpi/ibm/driver
--------------------------------------
Driver version -- /proc/acpi/ibm/driver
---------------------------------------
The driver name and version. No commands can be written to this file.
Hot Keys -- /proc/acpi/ibm/hotkey
Hot keys -- /proc/acpi/ibm/hotkey
---------------------------------
Without this driver, only the Fn-F4 key (sleep button) generates an
@ -188,7 +193,7 @@ and, on the X40, video corruption. By disabling automatic switching,
the flickering or video corruption can be avoided.
The video_switch command cycles through the available video outputs
(it sumulates the behavior of Fn-F7).
(it simulates the behavior of Fn-F7).
Video expansion can be toggled through this feature. This controls
whether the display is expanded to fill the entire LCD screen when a
@ -201,6 +206,12 @@ Fn-F7 from working. This also disables the video output switching
features of this driver, as it uses the same ACPI methods as
Fn-F7. Video switching on the console should still work.
UPDATE: There's now a patch for the X.org Radeon driver which
addresses this issue. Some people are reporting success with the patch
while others are still having problems. For more information:
https://bugs.freedesktop.org/show_bug.cgi?id=2000
ThinkLight control -- /proc/acpi/ibm/light
------------------------------------------
@ -211,7 +222,7 @@ models which do not make the status available will show it as
echo on > /proc/acpi/ibm/light
echo off > /proc/acpi/ibm/light
Docking / Undocking -- /proc/acpi/ibm/dock
Docking / undocking -- /proc/acpi/ibm/dock
------------------------------------------
Docking and undocking (e.g. with the X4 UltraBase) requires some
@ -228,11 +239,15 @@ NOTE: These events will only be generated if the laptop was docked
when originally booted. This is due to the current lack of support for
hot plugging of devices in the Linux ACPI framework. If the laptop was
booted while not in the dock, the following message is shown in the
logs: "ibm_acpi: dock device not present". No dock-related events are
generated but the dock and undock commands described below still
work. They can be executed manually or triggered by Fn key
combinations (see the example acpid configuration files included in
the driver tarball package available on the web site).
logs:
Mar 17 01:42:34 aero kernel: ibm_acpi: dock device not present
In this case, no dock-related events are generated but the dock and
undock commands described below still work. They can be executed
manually or triggered by Fn key combinations (see the example acpid
configuration files included in the driver tarball package available
on the web site).
When the eject request button on the dock is pressed, the first event
above is generated. The handler for this event should issue the
@ -267,7 +282,7 @@ the only docking stations currently supported are the X-series
UltraBase docks and "dumb" port replicators like the Mini Dock (the
latter don't need any ACPI support, actually).
UltraBay Eject -- /proc/acpi/ibm/bay
UltraBay eject -- /proc/acpi/ibm/bay
------------------------------------
Inserting or ejecting an UltraBay device requires some actions to be
@ -284,8 +299,11 @@ when the laptop was originally booted (on the X series, the UltraBay
is in the dock, so it may not be present if the laptop was undocked).
This is due to the current lack of support for hot plugging of devices
in the Linux ACPI framework. If the laptop was booted without the
UltraBay, the following message is shown in the logs: "ibm_acpi: bay
device not present". No bay-related events are generated but the eject
UltraBay, the following message is shown in the logs:
Mar 17 01:42:34 aero kernel: ibm_acpi: bay device not present
In this case, no bay-related events are generated but the eject
command described below still works. It can be executed manually or
triggered by a hot key combination.
@ -306,22 +324,33 @@ necessary to enable the UltraBay device (e.g. call idectl).
The contents of the /proc/acpi/ibm/bay file shows the current status
of the UltraBay, as provided by the ACPI framework.
Experimental Features
---------------------
EXPERIMENTAL warm eject support on the 600e/x, A22p and A3x (To use
this feature, you need to supply the experimental=1 parameter when
loading the module):
The following features are marked experimental because using them
involves guessing the correct values of some parameters. Guessing
incorrectly may have undesirable effects like crashing your
ThinkPad. USE THESE WITH CAUTION! To activate them, you'll need to
supply the experimental=1 parameter when loading the module.
These models do not have a button near the UltraBay device to request
a hot eject but rather require the laptop to be put to sleep
(suspend-to-ram) before the bay device is ejected or inserted).
The sequence of steps to eject the device is as follows:
Experimental: CMOS control - /proc/acpi/ibm/cmos
------------------------------------------------
echo eject > /proc/acpi/ibm/bay
put the ThinkPad to sleep
remove the drive
resume from sleep
cat /proc/acpi/ibm/bay should show that the drive was removed
On the A3x, both the UltraBay 2000 and UltraBay Plus devices are
supported. Use "eject2" instead of "eject" for the second bay.
Note: the UltraBay eject support on the 600e/x, A22p and A3x is
EXPERIMENTAL and may not work as expected. USE WITH CAUTION!
CMOS control -- /proc/acpi/ibm/cmos
-----------------------------------
This feature is used internally by the ACPI firmware to control the
ThinkLight on most newer ThinkPad models. It appears that it can also
control LCD brightness, sounds volume and more, but only on some
models.
ThinkLight on most newer ThinkPad models. It may also control LCD
brightness, sounds volume and more, but only on some models.
The commands are non-negative integer numbers:
@ -330,10 +359,9 @@ The commands are non-negative integer numbers:
echo 2 >/proc/acpi/ibm/cmos
...
The range of numbers which are used internally by various models is 0
to 21, but it's possible that numbers outside this range have
interesting behavior. Here is the behavior on the X40 (tpb is the
ThinkPad Buttons utility):
The range of valid numbers is 0 to 21, but not all have an effect and
the behavior varies from model to model. Here is the behavior on the
X40 (tpb is the ThinkPad Buttons utility):
0 - no effect but tpb reports "Volume down"
1 - no effect but tpb reports "Volume up"
@ -346,26 +374,18 @@ ThinkPad Buttons utility):
13 - ThinkLight off
14 - no effect but tpb reports ThinkLight status change
If you try this feature, please send me a report similar to the
above. On models which allow control of LCD brightness or sound
volume, I'd like to provide this functionality in an user-friendly
way, but first I need a way to identify the models which this is
possible.
Experimental: LED control - /proc/acpi/ibm/LED
----------------------------------------------
LED control -- /proc/acpi/ibm/led
---------------------------------
Some of the LED indicators can be controlled through this feature. The
available commands are:
echo <led number> on >/proc/acpi/ibm/led
echo <led number> off >/proc/acpi/ibm/led
echo <led number> blink >/proc/acpi/ibm/led
echo '<led number> on' >/proc/acpi/ibm/led
echo '<led number> off' >/proc/acpi/ibm/led
echo '<led number> blink' >/proc/acpi/ibm/led
The <led number> parameter is a non-negative integer. The range of LED
numbers used internally by various models is 0 to 7 but it's possible
that numbers outside this range are also valid. Here is the mapping on
the X40:
The <led number> range is 0 to 7. The set of LEDs that can be
controlled varies from model to model. Here is the mapping on the X40:
0 - power
1 - battery (orange)
@ -376,49 +396,224 @@ the X40:
All of the above can be turned on and off and can be made to blink.
If you try this feature, please send me a report similar to the
above. I'd like to provide this functionality in an user-friendly way,
but first I need to identify the which numbers correspond to which
LEDs on various models.
Experimental: ACPI sounds - /proc/acpi/ibm/beep
-----------------------------------------------
ACPI sounds -- /proc/acpi/ibm/beep
----------------------------------
The BEEP method is used internally by the ACPI firmware to provide
audible alerts in various situtation. This feature allows the same
audible alerts in various situations. This feature allows the same
sounds to be triggered manually.
The commands are non-negative integer numbers:
echo 0 >/proc/acpi/ibm/beep
echo 1 >/proc/acpi/ibm/beep
echo 2 >/proc/acpi/ibm/beep
...
echo <number> >/proc/acpi/ibm/beep
The range of numbers which are used internally by various models is 0
to 17, but it's possible that numbers outside this range are also
valid. Here is the behavior on the X40:
The valid <number> range is 0 to 17. Not all numbers trigger sounds
and the sounds vary from model to model. Here is the behavior on the
X40:
2 - two beeps, pause, third beep
0 - stop a sound in progress (but use 17 to stop 16)
2 - two beeps, pause, third beep ("low battery")
3 - single beep
4 - "unable"
4 - high, followed by low-pitched beep ("unable")
5 - single beep
6 - "AC/DC"
6 - very high, followed by high-pitched beep ("AC/DC")
7 - high-pitched beep
9 - three short beeps
10 - very long beep
12 - low-pitched beep
15 - three high-pitched beeps repeating constantly, stop with 0
16 - one medium-pitched beep repeating constantly, stop with 17
17 - stop 16
(I've only been able to identify a couple of them).
Temperature sensors -- /proc/acpi/ibm/thermal
---------------------------------------------
If you try this feature, please send me a report similar to the
above. I'd like to provide this functionality in an user-friendly way,
but first I need to identify the which numbers correspond to which
sounds on various models.
Most ThinkPads include six or more separate temperature sensors but
only expose the CPU temperature through the standard ACPI methods.
This feature shows readings from up to eight different sensors. Some
readings may not be valid, e.g. may show large negative values. For
example, on the X40, a typical output may be:
temperatures: 42 42 45 41 36 -128 33 -128
Thomas Gruber took his R51 apart and traced all six active sensors in
his laptop (the location of sensors may vary on other models):
1: CPU
2: Mini PCI Module
3: HDD
4: GPU
5: Battery
6: N/A
7: Battery
8: N/A
No commands can be written to this file.
EXPERIMENTAL: Embedded controller reigster dump -- /proc/acpi/ibm/ecdump
------------------------------------------------------------------------
This feature is marked EXPERIMENTAL because the implementation
directly accesses hardware registers and may not work as expected. USE
WITH CAUTION! To use this feature, you need to supply the
experimental=1 parameter when loading the module.
This feature dumps the values of 256 embedded controller
registers. Values which have changed since the last time the registers
were dumped are marked with a star:
[root@x40 ibm-acpi]# cat /proc/acpi/ibm/ecdump
EC +00 +01 +02 +03 +04 +05 +06 +07 +08 +09 +0a +0b +0c +0d +0e +0f
EC 0x00: a7 47 87 01 fe 96 00 08 01 00 cb 00 00 00 40 00
EC 0x10: 00 00 ff ff f4 3c 87 09 01 ff 42 01 ff ff 0d 00
EC 0x20: 00 00 00 00 00 00 00 00 00 00 00 03 43 00 00 80
EC 0x30: 01 07 1a 00 30 04 00 00 *85 00 00 10 00 50 00 00
EC 0x40: 00 00 00 00 00 00 14 01 00 04 00 00 00 00 00 00
EC 0x50: 00 c0 02 0d 00 01 01 02 02 03 03 03 03 *bc *02 *bc
EC 0x60: *02 *bc *02 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0x70: 00 00 00 00 00 12 30 40 *24 *26 *2c *27 *20 80 *1f 80
EC 0x80: 00 00 00 06 *37 *0e 03 00 00 00 0e 07 00 00 00 00
EC 0x90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xa0: *ff 09 ff 09 ff ff *64 00 *00 *00 *a2 41 *ff *ff *e0 00
EC 0xb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xd0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xe0: 00 00 00 00 00 00 00 00 11 20 49 04 24 06 55 03
EC 0xf0: 31 55 48 54 35 38 57 57 08 2f 45 73 07 65 6c 1a
This feature can be used to determine the register holding the fan
speed on some models. To do that, do the following:
- make sure the battery is fully charged
- make sure the fan is running
- run 'cat /proc/acpi/ibm/ecdump' several times, once per second or so
The first step makes sure various charging-related values don't
vary. The second ensures that the fan-related values do vary, since
the fan speed fluctuates a bit. The third will (hopefully) mark the
fan register with a star:
[root@x40 ibm-acpi]# cat /proc/acpi/ibm/ecdump
EC +00 +01 +02 +03 +04 +05 +06 +07 +08 +09 +0a +0b +0c +0d +0e +0f
EC 0x00: a7 47 87 01 fe 96 00 08 01 00 cb 00 00 00 40 00
EC 0x10: 00 00 ff ff f4 3c 87 09 01 ff 42 01 ff ff 0d 00
EC 0x20: 00 00 00 00 00 00 00 00 00 00 00 03 43 00 00 80
EC 0x30: 01 07 1a 00 30 04 00 00 85 00 00 10 00 50 00 00
EC 0x40: 00 00 00 00 00 00 14 01 00 04 00 00 00 00 00 00
EC 0x50: 00 c0 02 0d 00 01 01 02 02 03 03 03 03 bc 02 bc
EC 0x60: 02 bc 02 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0x70: 00 00 00 00 00 12 30 40 24 27 2c 27 21 80 1f 80
EC 0x80: 00 00 00 06 *be 0d 03 00 00 00 0e 07 00 00 00 00
EC 0x90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xa0: ff 09 ff 09 ff ff 64 00 00 00 a2 41 ff ff e0 00
EC 0xb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xd0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EC 0xe0: 00 00 00 00 00 00 00 00 11 20 49 04 24 06 55 03
EC 0xf0: 31 55 48 54 35 38 57 57 08 2f 45 73 07 65 6c 1a
Another set of values that varies often is the temperature
readings. Since temperatures don't change vary fast, you can take
several quick dumps to eliminate them.
You can use a similar method to figure out the meaning of other
embedded controller registers - e.g. make sure nothing else changes
except the charging or discharging battery to determine which
registers contain the current battery capacity, etc. If you experiment
with this, do send me your results (including some complete dumps with
a description of the conditions when they were taken.)
EXPERIMENTAL: LCD brightness control -- /proc/acpi/ibm/brightness
-----------------------------------------------------------------
This feature is marked EXPERIMENTAL because the implementation
directly accesses hardware registers and may not work as expected. USE
WITH CAUTION! To use this feature, you need to supply the
experimental=1 parameter when loading the module.
This feature allows software control of the LCD brightness on ThinkPad
models which don't have a hardware brightness slider. The available
commands are:
echo up >/proc/acpi/ibm/brightness
echo down >/proc/acpi/ibm/brightness
echo 'level <level>' >/proc/acpi/ibm/brightness
The <level> number range is 0 to 7, although not all of them may be
distinct. The current brightness level is shown in the file.
EXPERIMENTAL: Volume control -- /proc/acpi/ibm/volume
-----------------------------------------------------
This feature is marked EXPERIMENTAL because the implementation
directly accesses hardware registers and may not work as expected. USE
WITH CAUTION! To use this feature, you need to supply the
experimental=1 parameter when loading the module.
This feature allows volume control on ThinkPad models which don't have
a hardware volume knob. The available commands are:
echo up >/proc/acpi/ibm/volume
echo down >/proc/acpi/ibm/volume
echo mute >/proc/acpi/ibm/volume
echo 'level <level>' >/proc/acpi/ibm/volume
The <level> number range is 0 to 15 although not all of them may be
distinct. The unmute the volume after the mute command, use either the
up or down command (the level command will not unmute the volume).
The current volume level and mute state is shown in the file.
EXPERIMENTAL: fan speed, fan enable/disable -- /proc/acpi/ibm/fan
-----------------------------------------------------------------
This feature is marked EXPERIMENTAL because the implementation
directly accesses hardware registers and may not work as expected. USE
WITH CAUTION! To use this feature, you need to supply the
experimental=1 parameter when loading the module.
This feature attempts to show the current fan speed. The speed is read
directly from the hardware registers of the embedded controller. This
is known to work on later R, T and X series ThinkPads but may show a
bogus value on other models.
The fan may be enabled or disabled with the following commands:
echo enable >/proc/acpi/ibm/fan
echo disable >/proc/acpi/ibm/fan
WARNING WARNING WARNING: do not leave the fan disabled unless you are
monitoring the temperature sensor readings and you are ready to enable
it if necessary to avoid overheating.
The fan only runs if it's enabled *and* the various temperature
sensors which control it read high enough. On the X40, this seems to
depend on the CPU and HDD temperatures. Specifically, the fan is
turned on when either the CPU temperature climbs to 56 degrees or the
HDD temperature climbs to 46 degrees. The fan is turned off when the
CPU temperature drops to 49 degrees and the HDD temperature drops to
41 degrees. These thresholds cannot currently be controlled.
On the X31 and X40 (and ONLY on those models), the fan speed can be
controlled to a certain degree. Once the fan is running, it can be
forced to run faster or slower with the following command:
echo 'speed <speed>' > /proc/acpi/ibm/thermal
The sustainable range of fan speeds on the X40 appears to be from
about 3700 to about 7350. Values outside this range either do not have
any effect or the fan speed eventually settles somewhere in that
range. The fan cannot be stopped or started with this command.
On the 570, temperature readings are not available through this
feature and the fan control works a little differently. The fan speed
is reported in levels from 0 (off) to 7 (max) and can be controlled
with the following command:
echo 'level <level>' > /proc/acpi/ibm/thermal
Multiple Command, Module Parameters
-----------------------------------
Multiple Commands, Module Parameters
------------------------------------
Multiple commands can be written to the proc files in one shot by
separating them with commas, for example:
@ -451,24 +646,19 @@ scripts (included with ibm-acpi for completeness):
/usr/local/sbin/laptop_mode -- from the Linux kernel source
distribution, see Documentation/laptop-mode.txt
/sbin/service -- comes with Redhat/Fedora distributions
/usr/sbin/hibernate -- from the Software Suspend 2 distribution,
see http://softwaresuspend.berlios.de/
Toan T Nguyen <ntt@control.uchicago.edu> has written a SuSE powersave
script for the X20, included in config/usr/sbin/ibm_hotkeys_X20
Toan T Nguyen <ntt@physics.ucla.edu> notes that Suse uses the
powersave program to suspend ('powersave --suspend-to-ram') or
hibernate ('powersave --suspend-to-disk'). This means that the
hibernate script is not needed on that distribution.
Henrik Brix Andersen <brix@gentoo.org> has written a Gentoo ACPI event
handler script for the X31. You can get the latest version from
http://dev.gentoo.org/~brix/files/x31.sh
David Schweikert <dws@ee.eth.ch> has written an alternative blank.sh
script which works on Debian systems, included in
configs/etc/acpi/actions/blank-debian.sh
TODO
----
I'd like to implement the following features but haven't yet found the
time and/or I don't yet know how to implement them:
- UltraBay floppy drive support
script which works on Debian systems. This scripts has now been
extended to also work on Fedora systems and included as the default
blank.sh in the distribution.

View File

@ -0,0 +1,84 @@
Apple Touchpad Driver (appletouch)
----------------------------------
Copyright (C) 2005 Stelian Pop <stelian@popies.net>
appletouch is a Linux kernel driver for the USB touchpad found on post
February 2005 Apple Alu Powerbooks.
This driver is derived from Johannes Berg's appletrackpad driver[1], but it has
been improved in some areas:
* appletouch is a full kernel driver, no userspace program is necessary
* appletouch can be interfaced with the synaptics X11 driver, in order
to have touchpad acceleration, scrolling, etc.
Credits go to Johannes Berg for reverse-engineering the touchpad protocol,
Frank Arnold for further improvements, and Alex Harper for some additional
information about the inner workings of the touchpad sensors.
Usage:
------
In order to use the touchpad in the basic mode, compile the driver and load
the module. A new input device will be detected and you will be able to read
the mouse data from /dev/input/mice (using gpm, or X11).
In X11, you can configure the touchpad to use the synaptics X11 driver, which
will give additional functionalities, like acceleration, scrolling, 2 finger
tap for middle button mouse emulation, 3 finger tap for right button mouse
emulation, etc. In order to do this, make sure you're using a recent version of
the synaptics driver (tested with 0.14.2, available from [2]), and configure a
new input device in your X11 configuration file (take a look below for an
example). For additional configuration, see the synaptics driver documentation.
Section "InputDevice"
Identifier "Synaptics Touchpad"
Driver "synaptics"
Option "SendCoreEvents" "true"
Option "Device" "/dev/input/mice"
Option "Protocol" "auto-dev"
Option "LeftEdge" "0"
Option "RightEdge" "850"
Option "TopEdge" "0"
Option "BottomEdge" "645"
Option "MinSpeed" "0.4"
Option "MaxSpeed" "1"
Option "AccelFactor" "0.02"
Option "FingerLow" "0"
Option "FingerHigh" "30"
Option "MaxTapMove" "20"
Option "MaxTapTime" "100"
Option "HorizScrollDelta" "0"
Option "VertScrollDelta" "30"
Option "SHMConfig" "on"
EndSection
Section "ServerLayout"
...
InputDevice "Mouse"
InputDevice "Synaptics Touchpad"
...
EndSection
Fuzz problems:
--------------
The touchpad sensors are very sensitive to heat, and will generate a lot of
noise when the temperature changes. This is especially true when you power-on
the laptop for the first time.
The appletouch driver tries to handle this noise and auto adapt itself, but it
is not perfect. If finger movements are not recognized anymore, try reloading
the driver.
You can activate debugging using the 'debug' module parameter. A value of 0
deactivates any debugging, 1 activates tracing of invalid samples, 2 activates
full tracing (each sample is being traced):
modprobe appletouch debug=1
or
echo "1" > /sys/module/appletouch/parameters/debug
Links:
------
[1]: http://johannes.sipsolutions.net/PowerBook/touchpad/
[2]: http://web.telia.com/~u89404340/touchpad/index.html

View File

@ -0,0 +1,203 @@
Driver documentation for yealink usb-p1k phones
0. Status
~~~~~~~~~
The p1k is a relatively cheap usb 1.1 phone with:
- keyboard full support, yealink.ko / input event API
- LCD full support, yealink.ko / sysfs API
- LED full support, yealink.ko / sysfs API
- dialtone full support, yealink.ko / sysfs API
- ringtone full support, yealink.ko / sysfs API
- audio playback full support, snd_usb_audio.ko / alsa API
- audio record full support, snd_usb_audio.ko / alsa API
For vendor documentation see http://www.yealink.com
1. Compilation (stand alone version)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Currently only kernel 2.6.x.y versions are supported.
In order to build the yealink.ko module do:
make
If you encounter problems please check if in the MAKE_OPTS variable in
the Makefile is pointing to the location where your kernel sources
are located, default /usr/src/linux.
2. keyboard features
~~~~~~~~~~~~~~~~~~~~
The current mapping in the kernel is provided by the map_p1k_to_key
function:
Physical USB-P1K button layout input events
up up
IN OUT left, right
down down
pickup C hangup enter, backspace, escape
1 2 3 1, 2, 3
4 5 6 4, 5, 6,
7 8 9 7, 8, 9,
* 0 # *, 0, #,
The "up" and "down" keys, are symbolised by arrows on the button.
The "pickup" and "hangup" keys are symbolised by a green and red phone
on the button.
3. LCD features
~~~~~~~~~~~~~~~
The LCD is divided and organised as a 3 line display:
|[] [][] [][] [][] in |[][]
|[] M [][] D [][] : [][] out |[][]
store
NEW REP SU MO TU WE TH FR SA
[] [] [] [] [] [] [] [] [] [] [] []
[] [] [] [] [] [] [] [] [] [] [] []
Line 1 Format (see below) : 18.e8.M8.88...188
Icon names : M D : IN OUT STORE
Line 2 Format : .........
Icon name : NEW REP SU MO TU WE TH FR SA
Line 3 Format : 888888888888
Format description:
From a user space perspective the world is seperated in "digits" and "icons".
A digit can have a character set, an icon can only be ON or OFF.
Format specifier
'8' : Generic 7 segment digit with individual addressable segments
Reduced capabillity 7 segm digit, when segments are hard wired together.
'1' : 2 segments digit only able to produce a 1.
'e' : Most significant day of the month digit,
able to produce at least 1 2 3.
'M' : Most significant minute digit,
able to produce at least 0 1 2 3 4 5.
Icons or pictograms:
'.' : For example like AM, PM, SU, a 'dot' .. or other single segment
elements.
4. Driver usage
~~~~~~~~~~~~~~~
For userland the following interfaces are available using the sysfs interface:
/sys/.../
line1 Read/Write, lcd line1
line2 Read/Write, lcd line2
line3 Read/Write, lcd line3
get_icons Read, returns a set of available icons.
hide_icon Write, hide the element by writing the icon name.
show_icon Write, display the element by writing the icon name.
map_seg7 Read/Write, the 7 segments char set, common for all
yealink phones. (see map_to_7segment.h)
ringtone Write, upload binary representation of a ringtone,
see yealink.c. status EXPERIMENTAL due to potential
races between async. and sync usb calls.
4.1 lineX
~~~~~~~~~
Reading /sys/../lineX will return the format string with its current value:
Example:
cat ./line3
888888888888
Linux Rocks!
Writing to /sys/../lineX will set the coresponding LCD line.
- Excess characters are ignored.
- If less characters are written than allowed, the remaining digits are
unchanged.
- The tab '\t'and '\n' char does not overwrite the original content.
- Writing a space to an icon will always hide its content.
Example:
date +"%m.%e.%k:%M" | sed 's/^0/ /' > ./line1
Will update the LCD with the current date & time.
4.2 get_icons
~~~~~~~~~~~~~
Reading will return all available icon names and its current settings:
cat ./get_icons
on M
on D
on :
IN
OUT
STORE
NEW
REP
SU
MO
TU
WE
TH
FR
SA
LED
DIALTONE
RINGTONE
4.3 show/hide icons
~~~~~~~~~~~~~~~~~~~
Writing to these files will update the state of the icon.
Only one icon at a time can be updated.
If an icon is also on a ./lineX the corresponding value is
updated with the first letter of the icon.
Example - light up the store icon:
echo -n "STORE" > ./show_icon
cat ./line1
18.e8.M8.88...188
S
Example - sound the ringtone for 10 seconds:
echo -n RINGTONE > /sys/..../show_icon
sleep 10
echo -n RINGTONE > /sys/..../hide_icon
5. Sound features
~~~~~~~~~~~~~~~~~
Sound is supported by the ALSA driver: snd_usb_audio
One 16-bit channel with sample and playback rates of 8000 Hz is the practical
limit of the device.
Example - recording test:
arecord -v -d 10 -r 8000 -f S16_LE -t wav foobar.wav
Example - playback test:
aplay foobar.wav
6. Credits & Acknowledgments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Olivier Vandorpe, for starting the usbb2k-api project doing much of
the reverse engineering.
- Martin Diehl, for pointing out how to handle USB memory allocation.
- Dmitry Torokhov, for the numerous code reviews and suggestions.

View File

@ -878,7 +878,7 @@ DVD_READ_STRUCT Read structure
error returns:
EINVAL physical.layer_num exceeds number of layers
EIO Recieved invalid response from drive
EIO Received invalid response from drive

View File

@ -31,7 +31,7 @@ This document describes the Linux kernel Makefiles.
=== 6 Architecture Makefiles
--- 6.1 Set variables to tweak the build to the architecture
--- 6.2 Add prerequisites to prepare:
--- 6.2 Add prerequisites to archprepare:
--- 6.3 List directories to visit when descending
--- 6.4 Architecture specific boot images
--- 6.5 Building non-kbuild targets
@ -734,18 +734,18 @@ When kbuild executes the following steps are followed (roughly):
for loadable kernel modules.
--- 6.2 Add prerequisites to prepare:
--- 6.2 Add prerequisites to archprepare:
The prepare: rule is used to list prerequisites that needs to be
The archprepare: rule is used to list prerequisites that needs to be
built before starting to descend down in the subdirectories.
This is usual header files containing assembler constants.
Example:
#arch/s390/Makefile
prepare: include/asm-$(ARCH)/offsets.h
#arch/arm/Makefile
archprepare: maketools
In this example the file include/asm-$(ARCH)/offsets.h will
be built before descending down in the subdirectories.
In this example the file target maketools will be processed
before descending down in the subdirectories.
See also chapter XXX-TODO that describe how kbuild supports
generating offset header files.
@ -872,7 +872,13 @@ When kbuild executes the following steps are followed (roughly):
Assignments to $(targets) are without $(obj)/ prefix.
if_changed may be used in conjunction with custom commands as
defined in 6.7 "Custom kbuild commands".
Note: It is a typical mistake to forget the FORCE prerequisite.
Another common pitfall is that whitespace is sometimes
significant; for instance, the below will fail (note the extra space
after the comma):
target: source(s) FORCE
#WRONG!# $(call if_changed, ld/objcopy/gzip)
ld
Link target. Often LDFLAGS_$@ is used to set specific options to ld.

View File

@ -39,8 +39,7 @@ SETUP
and apply http://lse.sourceforge.net/kdump/patches/kexec-tools-1.101-kdump.patch
and after that build the source.
2) Download and build the appropriate (latest) kexec/kdump (-mm) kernel
patchset and apply it to the vanilla kernel tree.
2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernel.
Two kernels need to be built in order to get this feature working.
@ -84,15 +83,16 @@ SETUP
4) Load the second kernel to be booted using:
kexec -p <second-kernel> --crash-dump --args-linux --append="root=<root-dev>
init 1 irqpoll"
kexec -p <second-kernel> --args-linux --elf32-core-headers
--append="root=<root-dev> init 1 irqpoll"
Note: i) <second-kernel> has to be a vmlinux image. bzImage will not work,
as of now.
ii) By default ELF headers are stored in ELF32 format (for i386). This
is sufficient to represent the physical memory up to 4GB. To store
headers in ELF64 format, specifiy "--elf64-core-headers" on the
kexec command line additionally.
ii) By default ELF headers are stored in ELF64 format. Option
--elf32-core-headers forces generation of ELF32 headers. gdb can
not open ELF64 headers on 32 bit systems. So creating ELF32
headers can come handy for users who have got non-PAE systems and
hence have memory less than 4GB.
iii) Specify "irqpoll" as command line parameter. This reduces driver
initialization failures in second kernel due to shared interrupts.

View File

@ -159,6 +159,20 @@ running once the system is up.
acpi_fake_ecdt [HW,ACPI] Workaround failure due to BIOS lacking ECDT
acpi_generic_hotkey [HW,ACPI]
Allow consolidated generic hotkey driver to
over-ride platform specific driver.
See also Documentation/acpi-hotkey.txt.
enable_timer_pin_1 [i386,x86-64]
Enable PIN 1 of APIC timer
Can be useful to work around chipset bugs (in particular on some ATI chipsets)
The kernel tries to set a reasonable default.
disable_timer_pin_1 [i386,x86-64]
Disable PIN 1 of APIC timer
Can be useful to work around chipset bugs.
ad1816= [HW,OSS]
Format: <io>,<irq>,<dma>,<dma2>
See also Documentation/sound/oss/AD1816.
@ -544,6 +558,7 @@ running once the system is up.
keyboard and can not control its state
(Don't attempt to blink the leds)
i8042.noaux [HW] Don't check for auxiliary (== mouse) port
i8042.nokbd [HW] Don't check/create keyboard port
i8042.nomux [HW] Don't check presence of an active multiplexing
controller
i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX
@ -1169,6 +1184,11 @@ running once the system is up.
New name for the ramdisk parameter.
See Documentation/ramdisk.txt.
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
reboot= [BUGS=IA-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
Format: <reboot_mode>[,<reboot_mode2>[,...]]
See arch/*/kernel/reboot.c.

View File

@ -30,7 +30,7 @@ other program after you have done the following:
Read the file 'binfmt_misc.txt' in this directory to know
more about the configuration process.
3) Add the following enries to /etc/rc.local or similar script
3) Add the following entries to /etc/rc.local or similar script
to be run at system startup:
# Insert BINFMT_MISC module into the kernel

View File

@ -0,0 +1,246 @@
===========================
Intel(R) PRO/Wireless 2100 Network Connection Driver for Linux
README.ipw2100
March 14, 2005
===========================
Index
---------------------------
0. Introduction
1. Release 1.1.0 Current Features
2. Command Line Parameters
3. Sysfs Helper Files
4. Radio Kill Switch
5. Dynamic Firmware
6. Power Management
7. Support
8. License
===========================
0. Introduction
------------ ----- ----- ---- --- -- -
This document provides a brief overview of the features supported by the
IPW2100 driver project. The main project website, where the latest
development version of the driver can be found, is:
http://ipw2100.sourceforge.net
There you can find the not only the latest releases, but also information about
potential fixes and patches, as well as links to the development mailing list
for the driver project.
===========================
1. Release 1.1.0 Current Supported Features
---------------------------
- Managed (BSS) and Ad-Hoc (IBSS)
- WEP (shared key and open)
- Wireless Tools support
- 802.1x (tested with XSupplicant 1.0.1)
Enabled (but not supported) features:
- Monitor/RFMon mode
- WPA/WPA2
The distinction between officially supported and enabled is a reflection
on the amount of validation and interoperability testing that has been
performed on a given feature.
===========================
2. Command Line Parameters
---------------------------
If the driver is built as a module, the following optional parameters are used
by entering them on the command line with the modprobe command using this
syntax:
modprobe ipw2100 [<option>=<VAL1><,VAL2>...]
For example, to disable the radio on driver loading, enter:
modprobe ipw2100 disable=1
The ipw2100 driver supports the following module parameters:
Name Value Example:
debug 0x0-0xffffffff debug=1024
mode 0,1,2 mode=1 /* AdHoc */
channel int channel=3 /* Only valid in AdHoc or Monitor */
associate boolean associate=0 /* Do NOT auto associate */
disable boolean disable=1 /* Do not power the HW */
===========================
3. Sysfs Helper Files
---------------------------
There are several ways to control the behavior of the driver. Many of the
general capabilities are exposed through the Wireless Tools (iwconfig). There
are a few capabilities that are exposed through entries in the Linux Sysfs.
----- Driver Level ------
For the driver level files, look in /sys/bus/pci/drivers/ipw2100/
debug_level
This controls the same global as the 'debug' module parameter. For
information on the various debugging levels available, run the 'dvals'
script found in the driver source directory.
NOTE: 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn
on.
----- Device Level ------
For the device level files look in
/sys/bus/pci/drivers/ipw2100/{PCI-ID}/
For example:
/sys/bus/pci/drivers/ipw2100/0000:02:01.0
For the device level files, see /sys/bus/pci/drivers/ipw2100:
rf_kill
read -
0 = RF kill not enabled (radio on)
1 = SW based RF kill active (radio off)
2 = HW based RF kill active (radio off)
3 = Both HW and SW RF kill active (radio off)
write -
0 = If SW based RF kill active, turn the radio back on
1 = If radio is on, activate SW based RF kill
NOTE: If you enable the SW based RF kill and then toggle the HW
based RF kill from ON -> OFF -> ON, the radio will NOT come back on
===========================
4. Radio Kill Switch
---------------------------
Most laptops provide the ability for the user to physically disable the radio.
Some vendors have implemented this as a physical switch that requires no
software to turn the radio off and on. On other laptops, however, the switch
is controlled through a button being pressed and a software driver then making
calls to turn the radio off and on. This is referred to as a "software based
RF kill switch"
See the Sysfs helper file 'rf_kill' for determining the state of the RF switch
on your system.
===========================
5. Dynamic Firmware
---------------------------
As the firmware is licensed under a restricted use license, it can not be
included within the kernel sources. To enable the IPW2100 you will need a
firmware image to load into the wireless NIC's processors.
You can obtain these images from <http://ipw2100.sf.net/firmware.php>.
See INSTALL for instructions on installing the firmware.
===========================
6. Power Management
---------------------------
The IPW2100 supports the configuration of the Power Save Protocol
through a private wireless extension interface. The IPW2100 supports
the following different modes:
off No power management. Radio is always on.
on Automatic power management
1-5 Different levels of power management. The higher the
number the greater the power savings, but with an impact to
packet latencies.
Power management works by powering down the radio after a certain
interval of time has passed where no packets are passed through the
radio. Once powered down, the radio remains in that state for a given
period of time. For higher power savings, the interval between last
packet processed to sleep is shorter and the sleep period is longer.
When the radio is asleep, the access point sending data to the station
must buffer packets at the AP until the station wakes up and requests
any buffered packets. If you have an AP that does not correctly support
the PSP protocol you may experience packet loss or very poor performance
while power management is enabled. If this is the case, you will need
to try and find a firmware update for your AP, or disable power
management (via `iwconfig eth1 power off`)
To configure the power level on the IPW2100 you use a combination of
iwconfig and iwpriv. iwconfig is used to turn power management on, off,
and set it to auto.
iwconfig eth1 power off Disables radio power down
iwconfig eth1 power on Enables radio power management to
last set level (defaults to AUTO)
iwpriv eth1 set_power 0 Sets power level to AUTO and enables
power management if not previously
enabled.
iwpriv eth1 set_power 1-5 Set the power level as specified,
enabling power management if not
previously enabled.
You can view the current power level setting via:
iwpriv eth1 get_power
It will return the current period or timeout that is configured as a string
in the form of xxxx/yyyy (z) where xxxx is the timeout interval (amount of
time after packet processing), yyyy is the period to sleep (amount of time to
wait before powering the radio and querying the access point for buffered
packets), and z is the 'power level'. If power management is turned off the
xxxx/yyyy will be replaced with 'off' -- the level reported will be the active
level if `iwconfig eth1 power on` is invoked.
===========================
7. Support
---------------------------
For general development information and support,
go to:
http://ipw2100.sf.net/
The ipw2100 1.1.0 driver and firmware can be downloaded from:
http://support.intel.com
For installation support on the ipw2100 1.1.0 driver on Linux kernels
2.6.8 or greater, email support is available from:
http://supportmail.intel.com
===========================
8. License
---------------------------
Copyright(c) 2003 - 2005 Intel Corporation. All rights reserved.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License (version 2) as
published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA.
The full GNU General Public License is included in this distribution in the
file called LICENSE.
License Contact Information:
James P. Ketrenos <ipw2100-admin@linux.intel.com>
Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497

View File

@ -0,0 +1,300 @@
Intel(R) PRO/Wireless 2915ABG Driver for Linux in support of:
Intel(R) PRO/Wireless 2200BG Network Connection
Intel(R) PRO/Wireless 2915ABG Network Connection
Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R)
PRO/Wireless 2200BG Driver for Linux is a unified driver that works on
both hardware adapters listed above. In this document the Intel(R)
PRO/Wireless 2915ABG Driver for Linux will be used to reference the
unified driver.
Copyright (C) 2004-2005, Intel Corporation
README.ipw2200
Version: 1.0.0
Date : January 31, 2005
Index
-----------------------------------------------
1. Introduction
1.1. Overview of features
1.2. Module parameters
1.3. Wireless Extension Private Methods
1.4. Sysfs Helper Files
2. About the Version Numbers
3. Support
4. License
1. Introduction
-----------------------------------------------
The following sections attempt to provide a brief introduction to using
the Intel(R) PRO/Wireless 2915ABG Driver for Linux.
This document is not meant to be a comprehensive manual on
understanding or using wireless technologies, but should be sufficient
to get you moving without wires on Linux.
For information on building and installing the driver, see the INSTALL
file.
1.1. Overview of Features
-----------------------------------------------
The current release (1.0.0) supports the following features:
+ BSS mode (Infrastructure, Managed)
+ IBSS mode (Ad-Hoc)
+ WEP (OPEN and SHARED KEY mode)
+ 802.1x EAP via wpa_supplicant and xsupplicant
+ Wireless Extension support
+ Full B and G rate support (2200 and 2915)
+ Full A rate support (2915 only)
+ Transmit power control
+ S state support (ACPI suspend/resume)
+ long/short preamble support
1.2. Command Line Parameters
-----------------------------------------------
Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless
2915ABG Driver for Linux allows certain configuration options to be
provided as module parameters. The most common way to specify a module
parameter is via the command line.
The general form is:
% modprobe ipw2200 parameter=value
Where the supported parameter are:
associate
Set to 0 to disable the auto scan-and-associate functionality of the
driver. If disabled, the driver will not attempt to scan
for and associate to a network until it has been configured with
one or more properties for the target network, for example configuring
the network SSID. Default is 1 (auto-associate)
Example: % modprobe ipw2200 associate=0
auto_create
Set to 0 to disable the auto creation of an Ad-Hoc network
matching the channel and network name parameters provided.
Default is 1.
channel
channel number for association. The normal method for setting
the channel would be to use the standard wireless tools
(i.e. `iwconfig eth1 channel 10`), but it is useful sometimes
to set this while debugging. Channel 0 means 'ANY'
debug
If using a debug build, this is used to control the amount of debug
info is logged. See the 'dval' and 'load' script for more info on
how to use this (the dval and load scripts are provided as part
of the ipw2200 development snapshot releases available from the
SourceForge project at http://ipw2200.sf.net)
mode
Can be used to set the default mode of the adapter.
0 = Managed, 1 = Ad-Hoc
1.3. Wireless Extension Private Methods
-----------------------------------------------
As an interface designed to handle generic hardware, there are certain
capabilities not exposed through the normal Wireless Tool interface. As
such, a provision is provided for a driver to declare custom, or
private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux
defines several of these to configure various settings.
The general form of using the private wireless methods is:
% iwpriv $IFNAME method parameters
Where $IFNAME is the interface name the device is registered with
(typically eth1, customized via one of the various network interface
name managers, such as ifrename)
The supported private methods are:
get_mode
Can be used to report out which IEEE mode the driver is
configured to support. Example:
% iwpriv eth1 get_mode
eth1 get_mode:802.11bg (6)
set_mode
Can be used to configure which IEEE mode the driver will
support.
Usage:
% iwpriv eth1 set_mode {mode}
Where {mode} is a number in the range 1-7:
1 802.11a (2915 only)
2 802.11b
3 802.11ab (2915 only)
4 802.11g
5 802.11ag (2915 only)
6 802.11bg
7 802.11abg (2915 only)
get_preamble
Can be used to report configuration of preamble length.
set_preamble
Can be used to set the configuration of preamble length:
Usage:
% iwpriv eth1 set_preamble {mode}
Where {mode} is one of:
1 Long preamble only
0 Auto (long or short based on connection)
1.4. Sysfs Helper Files:
-----------------------------------------------
The Linux kernel provides a pseudo file system that can be used to
access various components of the operating system. The Intel(R)
PRO/Wireless 2915ABG Driver for Linux exposes several configuration
parameters through this mechanism.
An entry in the sysfs can support reading and/or writing. You can
typically query the contents of a sysfs entry through the use of cat,
and can set the contents via echo. For example:
% cat /sys/bus/pci/drivers/ipw2200/debug_level
Will report the current debug level of the driver's logging subsystem
(only available if CONFIG_IPW_DEBUG was configured when the driver was
built).
You can set the debug level via:
% echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level
Where $VALUE would be a number in the case of this sysfs entry. The
input to sysfs files does not have to be a number. For example, the
firmware loader used by hotplug utilizes sysfs entries for transferring
the firmware image from user space into the driver.
The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries
at two levels -- driver level, which apply to all instances of the
driver (in the event that there are more than one device installed) and
device level, which applies only to the single specific instance.
1.4.1 Driver Level Sysfs Helper Files
-----------------------------------------------
For the driver level files, look in /sys/bus/pci/drivers/ipw2200/
debug_level
This controls the same global as the 'debug' module parameter
1.4.2 Device Level Sysfs Helper Files
-----------------------------------------------
For the device level files, look in
/sys/bus/pci/drivers/ipw2200/{PCI-ID}/
For example:
/sys/bus/pci/drivers/ipw2200/0000:02:01.0
For the device level files, see /sys/bus/pci/[drivers/ipw2200:
rf_kill
read -
0 = RF kill not enabled (radio on)
1 = SW based RF kill active (radio off)
2 = HW based RF kill active (radio off)
3 = Both HW and SW RF kill active (radio off)
write -
0 = If SW based RF kill active, turn the radio back on
1 = If radio is on, activate SW based RF kill
NOTE: If you enable the SW based RF kill and then toggle the HW
based RF kill from ON -> OFF -> ON, the radio will NOT come back on
ucode
read-only access to the ucode version number
2. About the Version Numbers
-----------------------------------------------
Due to the nature of open source development projects, there are
frequently changes being incorporated that have not gone through
a complete validation process. These changes are incorporated into
development snapshot releases.
Releases are numbered with a three level scheme:
major.minor.development
Any version where the 'development' portion is 0 (for example
1.0.0, 1.1.0, etc.) indicates a stable version that will be made
available for kernel inclusion.
Any version where the 'development' portion is not a 0 (for
example 1.0.1, 1.1.5, etc.) indicates a development version that is
being made available for testing and cutting edge users. The stability
and functionality of the development releases are not know. We make
efforts to try and keep all snapshots reasonably stable, but due to the
frequency of their release, and the desire to get those releases
available as quickly as possible, unknown anomalies should be expected.
The major version number will be incremented when significant changes
are made to the driver. Currently, there are no major changes planned.
3. Support
-----------------------------------------------
For installation support of the 1.0.0 version, you can contact
http://supportmail.intel.com, or you can use the open source project
support.
For general information and support, go to:
http://ipw2200.sf.net/
4. License
-----------------------------------------------
Copyright(c) 2003 - 2005 Intel Corporation. All rights reserved.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA.
The full GNU General Public License is included in this distribution in the
file called LICENSE.
Contact Information:
James P. Ketrenos <ipw2100-admin@linux.intel.com>
Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497

View File

@ -1241,7 +1241,7 @@ traffic while still maintaining carrier on.
If running SNMP agents, the bonding driver should be loaded
before any network drivers participating in a bond. This requirement
is due to the the interface index (ipAdEntIfIndex) being associated to
is due to the interface index (ipAdEntIfIndex) being associated to
the first interface found with a given IP address. That is, there is
only one ipAdEntIfIndex for each IP address. For example, if eth0 and
eth1 are slaves of bond0 and the driver for eth0 is loaded before the
@ -1937,7 +1937,7 @@ switches currently available support 802.3ad.
If not explicitly configured (with ifconfig or ip link), the
MAC address of the bonding device is taken from its first slave
device. This MAC address is then passed to all following slaves and
remains persistent (even if the the first slave is removed) until the
remains persistent (even if the first slave is removed) until the
bonding device is brought down or reconfigured.
If you wish to change the MAC address, you can set it with

View File

@ -0,0 +1,352 @@
Chelsio N210 10Gb Ethernet Network Controller
Driver Release Notes for Linux
Version 2.1.1
June 20, 2005
CONTENTS
========
INTRODUCTION
FEATURES
PERFORMANCE
DRIVER MESSAGES
KNOWN ISSUES
SUPPORT
INTRODUCTION
============
This document describes the Linux driver for Chelsio 10Gb Ethernet Network
Controller. This driver supports the Chelsio N210 NIC and is backward
compatible with the Chelsio N110 model 10Gb NICs.
FEATURES
========
Adaptive Interrupts (adaptive-rx)
---------------------------------
This feature provides an adaptive algorithm that adjusts the interrupt
coalescing parameters, allowing the driver to dynamically adapt the latency
settings to achieve the highest performance during various types of network
load.
The interface used to control this feature is ethtool. Please see the
ethtool manpage for additional usage information.
By default, adaptive-rx is disabled.
To enable adaptive-rx:
ethtool -C <interface> adaptive-rx on
To disable adaptive-rx, use ethtool:
ethtool -C <interface> adaptive-rx off
After disabling adaptive-rx, the timer latency value will be set to 50us.
You may set the timer latency after disabling adaptive-rx:
ethtool -C <interface> rx-usecs <microseconds>
An example to set the timer latency value to 100us on eth0:
ethtool -C eth0 rx-usecs 100
You may also provide a timer latency value while disabling adpative-rx:
ethtool -C <interface> adaptive-rx off rx-usecs <microseconds>
If adaptive-rx is disabled and a timer latency value is specified, the timer
will be set to the specified value until changed by the user or until
adaptive-rx is enabled.
To view the status of the adaptive-rx and timer latency values:
ethtool -c <interface>
TCP Segmentation Offloading (TSO) Support
-----------------------------------------
This feature, also known as "large send", enables a system's protocol stack
to offload portions of outbound TCP processing to a network interface card
thereby reducing system CPU utilization and enhancing performance.
The interface used to control this feature is ethtool version 1.8 or higher.
Please see the ethtool manpage for additional usage information.
By default, TSO is enabled.
To disable TSO:
ethtool -K <interface> tso off
To enable TSO:
ethtool -K <interface> tso on
To view the status of TSO:
ethtool -k <interface>
PERFORMANCE
===========
The following information is provided as an example of how to change system
parameters for "performance tuning" an what value to use. You may or may not
want to change these system parameters, depending on your server/workstation
application. Doing so is not warranted in any way by Chelsio Communications,
and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss
of data or damage to equipment.
Your distribution may have a different way of doing things, or you may prefer
a different method. These commands are shown only to provide an example of
what to do and are by no means definitive.
Making any of the following system changes will only last until you reboot
your system. You may want to write a script that runs at boot-up which
includes the optimal settings for your system.
Setting PCI Latency Timer:
setpci -d 1425:* 0x0c.l=0x0000F800
Disabling TCP timestamp:
sysctl -w net.ipv4.tcp_timestamps=0
Disabling SACK:
sysctl -w net.ipv4.tcp_sack=0
Setting large number of incoming connection requests:
sysctl -w net.ipv4.tcp_max_syn_backlog=3000
Setting maximum receive socket buffer size:
sysctl -w net.core.rmem_max=1024000
Setting maximum send socket buffer size:
sysctl -w net.core.wmem_max=1024000
Set smp_affinity (on a multiprocessor system) to a single CPU:
echo 1 > /proc/irq/<interrupt_number>/smp_affinity
Setting default receive socket buffer size:
sysctl -w net.core.rmem_default=524287
Setting default send socket buffer size:
sysctl -w net.core.wmem_default=524287
Setting maximum option memory buffers:
sysctl -w net.core.optmem_max=524287
Setting maximum backlog (# of unprocessed packets before kernel drops):
sysctl -w net.core.netdev_max_backlog=300000
Setting TCP read buffers (min/default/max):
sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
Setting TCP write buffers (min/pressure/max):
sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
Setting TCP buffer space (min/pressure/max):
sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000"
TCP window size for single connections:
The receive buffer (RX_WINDOW) size must be at least as large as the
Bandwidth-Delay Product of the communication link between the sender and
receiver. Due to the variations of RTT, you may want to increase the buffer
size up to 2 times the Bandwidth-Delay Product. Reference page 289 of
"TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens.
At 10Gb speeds, use the following formula:
RX_WINDOW >= 1.25MBytes * RTT(in milliseconds)
Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000
RX_WINDOW sizes of 256KB - 512KB should be sufficient.
Setting the min, max, and default receive buffer (RX_WINDOW) size:
sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>"
TCP window size for multiple connections:
The receive buffer (RX_WINDOW) size may be calculated the same as single
connections, but should be divided by the number of connections. The
smaller window prevents congestion and facilitates better pacing,
especially if/when MAC level flow control does not work well or when it is
not supported on the machine. Experimentation may be necessary to attain
the correct value. This method is provided as a starting point fot the
correct receive buffer size.
Setting the min, max, and default receive buffer (RX_WINDOW) size is
performed in the same manner as single connection.
DRIVER MESSAGES
===============
The following messages are the most common messages logged by syslog. These
may be found in /var/log/messages.
Driver up:
Chelsio Network Driver - version 2.1.1
NIC detected:
eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit
Link up:
eth#: link is up at 10 Gbps, full duplex
Link down:
eth#: link is down
KNOWN ISSUES
============
These issues have been identified during testing. The following information
is provided as a workaround to the problem. In some cases, this problem is
inherent to Linux or to a particular Linux Distribution and/or hardware
platform.
1. Large number of TCP retransmits on a multiprocessor (SMP) system.
On a system with multiple CPUs, the interrupt (IRQ) for the network
controller may be bound to more than one CPU. This will cause TCP
retransmits if the packet data were to be split across different CPUs
and re-assembled in a different order than expected.
To eliminate the TCP retransmits, set smp_affinity on the particular
interrupt to a single CPU. You can locate the interrupt (IRQ) used on
the N110/N210 by using ifconfig:
ifconfig <dev_name> | grep Interrupt
Set the smp_affinity to a single CPU:
echo 1 > /proc/irq/<interrupt_number>/smp_affinity
It is highly suggested that you do not run the irqbalance daemon on your
system, as this will change any smp_affinity setting you have applied.
The irqbalance daemon runs on a 10 second interval and binds interrupts
to the least loaded CPU determined by the daemon. To disable this daemon:
chkconfig --level 2345 irqbalance off
By default, some Linux distributions enable the kernel feature,
irqbalance, which performs the same function as the daemon. To disable
this feature, add the following line to your bootloader:
noirqbalance
Example using the Grub bootloader:
title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp)
root (hd0,0)
kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance
initrd /initrd-2.4.21-27.ELsmp.img
2. After running insmod, the driver is loaded and the incorrect network
interface is brought up without running ifup.
When using 2.4.x kernels, including RHEL kernels, the Linux kernel
invokes a script named "hotplug". This script is primarily used to
automatically bring up USB devices when they are plugged in, however,
the script also attempts to automatically bring up a network interface
after loading the kernel module. The hotplug script does this by scanning
the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking
for HWADDR=<mac_address>.
If the hotplug script does not find the HWADDRR within any of the
ifcfg-eth# files, it will bring up the device with the next available
interface name. If this interface is already configured for a different
network card, your new interface will have incorrect IP address and
network settings.
To solve this issue, you can add the HWADDR=<mac_address> key to the
interface config file of your network controller.
To disable this "hotplug" feature, you may add the driver (module name)
to the "blacklist" file located in /etc/hotplug. It has been noted that
this does not work for network devices because the net.agent script
does not use the blacklist file. Simply remove, or rename, the net.agent
script located in /etc/hotplug to disable this feature.
3. Transport Protocol (TP) hangs when running heavy multi-connection traffic
on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset.
If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel
chipset, you may experience the "133-Mhz Mode Split Completion Data
Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the
bus PCI-X bus.
AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel
can provide stale data via split completion cycles to a PCI-X card that
is operating at 133 Mhz", causing data corruption.
AMD's provides three workarounds for this problem, however, Chelsio
recommends the first option for best performance with this bug:
For 133Mhz secondary bus operation, limit the transaction length and
the number of outstanding transactions, via BIOS configuration
programming of the PCI-X card, to the following:
Data Length (bytes): 1k
Total allowed outstanding transactions: 2
Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004,
section 56, "133-MHz Mode Split Completion Data Corruption" for more
details with this bug and workarounds suggested by AMD.
It may be possible to work outside AMD's recommended PCI-X settings, try
increasing the Data Length to 2k bytes for increased performance. If you
have issues with these settings, please revert to the "safe" settings
and duplicate the problem before submitting a bug or asking for support.
NOTE: The default setting on most systems is 8 outstanding transactions
and 2k bytes data length.
4. On multiprocessor systems, it has been noted that an application which
is handling 10Gb networking can switch between CPUs causing degraded
and/or unstable performance.
If running on an SMP system and taking performance measurements, it
is suggested you either run the latest netperf-2.4.0+ or use a binding
tool such as Tim Hockin's procstate utilities (runon)
<http://www.hockin.org/~thockin/procstate/>.
Binding netserver and netperf (or other applications) to particular
CPUs will have a significant difference in performance measurements.
You may need to experiment which CPU to bind the application to in
order to achieve the best performance for your system.
If you are developing an application designed for 10Gb networking,
please keep in mind you may want to look at kernel functions
sched_setaffinity & sched_getaffinity to bind your application.
If you are just running user-space applications such as ftp, telnet,
etc., you may want to try the runon tool provided by Tim Hockin's
procstate utility. You could also try binding the interface to a
particular CPU: runon 0 ifup eth0
SUPPORT
=======
If you have problems with the software or hardware, please contact our
customer support team via email at support@chelsio.com or check our website
at http://www.chelsio.com
===============================================================================
Chelsio Communications
370 San Aleso Ave.
Suite 100
Sunnyvale, CA 94085
http://www.chelsio.com
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License, version 2, as
published by the Free Software Foundation.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Copyright (c) 2003-2005 Chelsio Communications. All rights reserved.
===============================================================================

View File

@ -0,0 +1,288 @@
-------
PHY Abstraction Layer
(Updated 2005-07-21)
Purpose
Most network devices consist of set of registers which provide an interface
to a MAC layer, which communicates with the physical connection through a
PHY. The PHY concerns itself with negotiating link parameters with the link
partner on the other side of the network connection (typically, an ethernet
cable), and provides a register interface to allow drivers to determine what
settings were chosen, and to configure what settings are allowed.
While these devices are distinct from the network devices, and conform to a
standard layout for the registers, it has been common practice to integrate
the PHY management code with the network driver. This has resulted in large
amounts of redundant code. Also, on embedded systems with multiple (and
sometimes quite different) ethernet controllers connected to the same
management bus, it is difficult to ensure safe use of the bus.
Since the PHYs are devices, and the management busses through which they are
accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
In doing so, it has these goals:
1) Increase code-reuse
2) Increase overall code-maintainability
3) Speed development time for new network drivers, and for new systems
Basically, this layer is meant to provide an interface to PHY devices which
allows network driver writers to write as little code as possible, while
still providing a full feature set.
The MDIO bus
Most network devices are connected to a PHY by means of a management bus.
Different devices use different busses (though some share common interfaces).
In order to take advantage of the PAL, each bus interface needs to be
registered as a distinct device.
1) read and write functions must be implemented. Their prototypes are:
int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
int read(struct mii_bus *bus, int mii_id, int regnum);
mii_id is the address on the bus for the PHY, and regnum is the register
number. These functions are guaranteed not to be called from interrupt
time, so it is safe for them to block, waiting for an interrupt to signal
the operation is complete
2) A reset function is necessary. This is used to return the bus to an
initialized state.
3) A probe function is needed. This function should set up anything the bus
driver needs, setup the mii_bus structure, and register with the PAL using
mdiobus_register. Similarly, there's a remove function to undo all of
that (use mdiobus_unregister).
4) Like any driver, the device_driver structure must be configured, and init
exit functions are used to register the driver.
5) The bus must also be declared somewhere as a device, and registered.
As an example for how one driver implemented an mdio bus driver, see
drivers/net/gianfar_mii.c and arch/ppc/syslib/mpc85xx_devices.c
Connecting to a PHY
Sometime during startup, the network driver needs to establish a connection
between the PHY device, and the network device. At this time, the PHY's bus
and drivers need to all have been loaded, so it is ready for the connection.
At this point, there are several ways to connect to the PHY:
1) The PAL handles everything, and only calls the network driver when
the link state changes, so it can react.
2) The PAL handles everything except interrupts (usually because the
controller has the interrupt registers).
3) The PAL handles everything, but checks in with the driver every second,
allowing the network driver to react first to any changes before the PAL
does.
4) The PAL serves only as a library of functions, with the network device
manually calling functions to update status, and configure the PHY
Letting the PHY Abstraction Layer do Everything
If you choose option 1 (The hope is that every driver can, but to still be
useful to drivers that can't), connecting to the PHY is simple:
First, you need a function to react to changes in the link state. This
function follows this protocol:
static void adjust_link(struct net_device *dev);
Next, you need to know the device name of the PHY connected to this device.
The name will look something like, "phy0:0", where the first number is the
bus id, and the second is the PHY's address on that bus.
Now, to connect, just call this function:
phydev = phy_connect(dev, phy_name, &adjust_link, flags);
phydev is a pointer to the phy_device structure which represents the PHY. If
phy_connect is successful, it will return the pointer. dev, here, is the
pointer to your net_device. Once done, this function will have started the
PHY's software state machine, and registered for the PHY's interrupt, if it
has one. The phydev structure will be populated with information about the
current state, though the PHY will not yet be truly operational at this
point.
flags is a u32 which can optionally contain phy-specific flags.
This is useful if the system has put hardware restrictions on
the PHY/controller, of which the PHY needs to be aware.
Now just make sure that phydev->supported and phydev->advertising have any
values pruned from them which don't make sense for your controller (a 10/100
controller may be connected to a gigabit capable PHY, so you would need to
mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
for these bitfields. Note that you should not SET any bits, or the PHY may
get put into an unsupported state.
Lastly, once the controller is ready to handle network traffic, you call
phy_start(phydev). This tells the PAL that you are ready, and configures the
PHY to connect to the network. If you want to handle your own interrupts,
just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start.
Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL.
When you want to disconnect from the network (even if just briefly), you call
phy_stop(phydev).
Keeping Close Tabs on the PAL
It is possible that the PAL's built-in state machine needs a little help to
keep your network device and the PHY properly in sync. If so, you can
register a helper function when connecting to the PHY, which will be called
every second before the state machine reacts to any changes. To do this, you
need to manually call phy_attach() and phy_prepare_link(), and then call
phy_start_machine() with the second argument set to point to your special
handler.
Currently there are no examples of how to use this functionality, and testing
on it has been limited because the author does not have any drivers which use
it (they all use option 1). So Caveat Emptor.
Doing it all yourself
There's a remote chance that the PAL's built-in state machine cannot track
the complex interactions between the PHY and your network device. If this is
so, you can simply call phy_attach(), and not call phy_start_machine or
phy_prepare_link(). This will mean that phydev->state is entirely yours to
handle (phy_start and phy_stop toggle between some of the states, so you
might need to avoid them).
An effort has been made to make sure that useful functionality can be
accessed without the state-machine running, and most of these functions are
descended from functions which did not interact with a complex state-machine.
However, again, no effort has been made so far to test running without the
state machine, so tryer beware.
Here is a brief rundown of the functions:
int phy_read(struct phy_device *phydev, u16 regnum);
int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
Simple read/write primitives. They invoke the bus's read/write function
pointers.
void phy_print_status(struct phy_device *phydev);
A convenience function to print out the PHY status neatly.
int phy_clear_interrupt(struct phy_device *phydev);
int phy_config_interrupt(struct phy_device *phydev, u32 interrupts);
Clear the PHY's interrupt, and configure which ones are allowed,
respectively. Currently only supports all on, or all off.
int phy_enable_interrupts(struct phy_device *phydev);
int phy_disable_interrupts(struct phy_device *phydev);
Functions which enable/disable PHY interrupts, clearing them
before and after, respectively.
int phy_start_interrupts(struct phy_device *phydev);
int phy_stop_interrupts(struct phy_device *phydev);
Requests the IRQ for the PHY interrupts, then enables them for
start, or disables then frees them for stop.
struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
u32 flags);
Attaches a network device to a particular PHY, binding the PHY to a generic
driver if none was found during bus initialization. Passes in
any phy-specific flags as needed.
int phy_start_aneg(struct phy_device *phydev);
Using variables inside the phydev structure, either configures advertising
and resets autonegotiation, or disables autonegotiation, and configures
forced settings.
static inline int phy_read_status(struct phy_device *phydev);
Fills the phydev structure with up-to-date information about the current
settings in the PHY.
void phy_sanitize_settings(struct phy_device *phydev)
Resolves differences between currently desired settings, and
supported settings for the given PHY device. Does not make
the changes in the hardware, though.
int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd);
Ethtool convenience functions.
int phy_mii_ioctl(struct phy_device *phydev,
struct mii_ioctl_data *mii_data, int cmd);
The MII ioctl. Note that this function will completely screw up the state
machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
use this only to write registers which are not standard, and don't set off
a renegotiation.
PHY Device Drivers
With the PHY Abstraction Layer, adding support for new PHYs is
quite easy. In some cases, no work is required at all! However,
many PHYs require a little hand-holding to get up-and-running.
Generic PHY driver
If the desired PHY doesn't have any errata, quirks, or special
features you want to support, then it may be best to not add
support, and let the PHY Abstraction Layer's Generic PHY Driver
do all of the work.
Writing a PHY driver
If you do need to write a PHY driver, the first thing to do is
make sure it can be matched with an appropriate PHY device.
This is done during bus initialization by reading the device's
UID (stored in registers 2 and 3), then comparing it to each
driver's phy_id field by ANDing it with each driver's
phy_id_mask field. Also, it needs a name. Here's an example:
static struct phy_driver dm9161_driver = {
.phy_id = 0x0181b880,
.name = "Davicom DM9161E",
.phy_id_mask = 0x0ffffff0,
...
}
Next, you need to specify what features (speed, duplex, autoneg,
etc) your PHY device and driver support. Most PHYs support
PHY_BASIC_FEATURES, but you can look in include/mii.h for other
features.
Each driver consists of a number of function pointers:
config_init: configures PHY into a sane state after a reset.
For instance, a Davicom PHY requires descrambling disabled.
probe: Does any setup needed by the driver
suspend/resume: power management
config_aneg: Changes the speed/duplex/negotiation settings
read_status: Reads the current speed/duplex/negotiation settings
ack_interrupt: Clear a pending interrupt
config_intr: Enable or disable interrupts
remove: Does any driver take-down
Of these, only config_aneg and read_status are required to be
assigned by the driver code. The rest are optional. Also, it is
preferred to use the generic phy driver's versions of these two
functions if at all possible: genphy_read_status and
genphy_config_aneg. If this is not possible, it is likely that
you only need to perform some actions before and after invoking
these functions, and so your functions will wrap the generic
ones.
Feel free to look at the Marvell, Cicada, and Davicom drivers in
drivers/net/phy/ for examples (the lxt and qsemi drivers have
not been tested as of this writing)

View File

@ -355,7 +355,7 @@ REVISION HISTORY
There is no functional difference between the two packages
2.0.7 Aug 26, 1999 o Merged X25API code into WANPIPE.
o Fixed a memeory leak for X25API
o Fixed a memory leak for X25API
o Updated the X25API code for 2.2.X kernels.
o Improved NEM handling.
@ -514,7 +514,7 @@ beta2-2.2.0 Jan 8 2001
o Patches for 2.4.0 kernel
o Patches for 2.2.18 kernel
o Minor updates to PPP and CHLDC drivers.
Note: No functinal difference.
Note: No functional difference.
beta3-2.2.9 Jan 10 2001
o I missed the 2.2.18 kernel patches in beta2-2.2.0

View File

@ -84,7 +84,7 @@ Each entry consists of:
Most drivers don't need to use the driver_data field. Best practice
for use of driver_data is to use it as an index into a static list of
equivalant device types, not to use it as a pointer.
equivalent device types, not to use it as a pointer.
Have a table entry {PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID}
to have probe() called for every PCI device known to the system.
@ -266,20 +266,6 @@ port an old driver to the new PCI interface. They are no longer present
in the kernel as they aren't compatible with hotplug or PCI domains or
having sane locking.
pcibios_present() and Since ages, you don't need to test presence
pci_present() of PCI subsystem when trying to talk to it.
If it's not there, the list of PCI devices
is empty and all functions for searching for
devices just return NULL.
pcibios_(read|write)_* Superseded by their pci_(read|write)_*
counterparts.
pcibios_find_* Superseded by their pci_get_* counterparts.
pci_for_each_dev() Superseded by pci_get_device()
pci_for_each_dev_reverse() Superseded by pci_find_device_reverse()
pci_for_each_bus() Superseded by pci_find_next_bus()
pci_find_device() Superseded by pci_get_device()
pci_find_subsys() Superseded by pci_get_subsys()
pci_find_slot() Superseded by pci_get_slot()
pcibios_find_class() Superseded by pci_get_class()
pci_find_class() Superseded by pci_get_class()
pci_(read|write)_*_nodev() Superseded by pci_bus_(read|write)_*()

View File

@ -0,0 +1,138 @@
Author: Andreas Steinmetz <ast@domdv.de>
How to use dm-crypt and swsusp together:
========================================
Some prerequisites:
You know how dm-crypt works. If not, visit the following web page:
http://www.saout.de/misc/dm-crypt/
You have read Documentation/power/swsusp.txt and understand it.
You did read Documentation/initrd.txt and know how an initrd works.
You know how to create or how to modify an initrd.
Now your system is properly set up, your disk is encrypted except for
the swap device(s) and the boot partition which may contain a mini
system for crypto setup and/or rescue purposes. You may even have
an initrd that does your current crypto setup already.
At this point you want to encrypt your swap, too. Still you want to
be able to suspend using swsusp. This, however, means that you
have to be able to either enter a passphrase or that you read
the key(s) from an external device like a pcmcia flash disk
or an usb stick prior to resume. So you need an initrd, that sets
up dm-crypt and then asks swsusp to resume from the encrypted
swap device.
The most important thing is that you set up dm-crypt in such
a way that the swap device you suspend to/resume from has
always the same major/minor within the initrd as well as
within your running system. The easiest way to achieve this is
to always set up this swap device first with dmsetup, so that
it will always look like the following:
brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0
Now set up your kernel to use /dev/mapper/swap0 as the default
resume partition, so your kernel .config contains:
CONFIG_PM_STD_PARTITION="/dev/mapper/swap0"
Prepare your boot loader to use the initrd you will create or
modify. For lilo the simplest setup looks like the following
lines:
image=/boot/vmlinuz
initrd=/boot/initrd.gz
label=linux
append="root=/dev/ram0 init=/linuxrc rw"
Finally you need to create or modify your initrd. Lets assume
you create an initrd that reads the required dm-crypt setup
from a pcmcia flash disk card. The card is formatted with an ext2
fs which resides on /dev/hde1 when the card is inserted. The
card contains at least the encrypted swap setup in a file
named "swapkey". /etc/fstab of your initrd contains something
like the following:
/dev/hda1 /mnt ext3 ro 0 0
none /proc proc defaults,noatime,nodiratime 0 0
none /sys sysfs defaults,noatime,nodiratime 0 0
/dev/hda1 contains an unencrypted mini system that sets up all
of your crypto devices, again by reading the setup from the
pcmcia flash disk. What follows now is a /linuxrc for your
initrd that allows you to resume from encrypted swap and that
continues boot with your mini system on /dev/hda1 if resume
does not happen:
#!/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
mount /proc
mount /sys
mapped=0
noresume=`grep -c noresume /proc/cmdline`
if [ "$*" != "" ]
then
noresume=1
fi
dmesg -n 1
/sbin/cardmgr -q
for i in 1 2 3 4 5 6 7 8 9 0
do
if [ -f /proc/ide/hde/media ]
then
usleep 500000
mount -t ext2 -o ro /dev/hde1 /mnt
if [ -f /mnt/swapkey ]
then
dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1
fi
umount /mnt
break
fi
usleep 500000
done
killproc /sbin/cardmgr
dmesg -n 6
if [ $mapped = 1 ]
then
if [ $noresume != 0 ]
then
mkswap /dev/mapper/swap0 > /dev/null 2>&1
fi
echo 254:0 > /sys/power/resume
dmsetup remove swap0
fi
umount /sys
mount /mnt
umount /proc
cd /mnt
pivot_root . mnt
mount /proc
umount -l /mnt
umount /proc
exec chroot . /sbin/init $* < dev/console > dev/console 2>&1
Please don't mind the weird loop above, busybox's msh doesn't know
the let statement. Now, what is happening in the script?
First we have to decide if we want to try to resume, or not.
We will not resume if booting with "noresume" or any parameters
for init like "single" or "emergency" as boot parameters.
Then we need to set up dmcrypt with the setup data from the
pcmcia flash disk. If this succeeds we need to reset the swap
device if we don't want to resume. The line "echo 254:0 > /sys/power/resume"
then attempts to resume from the first device mapper device.
Note that it is important to set the device in /sys/power/resume,
regardless if resuming or not, otherwise later suspend will fail.
If resume starts, script execution terminates here.
Otherwise we just remove the encrypted swap device and leave it to the
mini system on /dev/hda1 to set the whole crypto up (it is up to
you to modify this to your taste).
What then follows is the well known process to change the root
file system and continue booting from there. I prefer to unmount
the initrd prior to continue booting but it is up to you to modify
this.

View File

@ -1,22 +1,20 @@
From kernel/suspend.c:
Some warnings, first.
* BIG FAT WARNING *********************************************************
*
* If you have unsupported (*) devices using DMA...
* ...say goodbye to your data.
*
* If you touch anything on disk between suspend and resume...
* ...kiss your data goodbye.
*
* If your disk driver does not support suspend... (IDE does)
* ...you'd better find out how to get along
* without your data.
* If you do resume from initrd after your filesystems are mounted...
* ...bye bye root partition.
* [this is actually same case as above]
*
* If you change kernel command line between suspend and resume...
* ...prepare for nasty fsck or worse.
*
* If you change your hardware while system is suspended...
* ...well, it was not good idea.
* If you have unsupported (*) devices using DMA, you may have some
* problems. If your disk driver does not support suspend... (IDE does),
* it may cause some problems, too. If you change kernel command line
* between suspend and resume, it may do something wrong. If you change
* your hardware while system is suspended... well, it was not good idea;
* but it will probably only crash.
*
* (*) suspend/resume support is needed to make it safe.
@ -30,6 +28,13 @@ echo shutdown > /sys/power/disk; echo disk > /sys/power/state
echo platform > /sys/power/disk; echo disk > /sys/power/state
Encrypted suspend image:
------------------------
If you want to store your suspend image encrypted with a temporary
key to prevent data gathering after resume you must compile
crypto and the aes algorithm into the kernel - modules won't work
as they cannot be loaded at resume time.
Article about goals and implementation of Software Suspend for Linux
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -85,11 +90,6 @@ resume.
You have your server on UPS. Power died, and UPS is indicating 30
seconds to failure. What do you do? Suspend to disk.
Ethernet card in your server died. You want to replace it. Your
server is not hotplug capable. What do you do? Suspend to disk,
replace ethernet card, resume. If you are fast your users will not
even see broken connections.
Q: Maybe I'm missing something, but why don't the regular I/O paths work?
@ -117,31 +117,6 @@ Q: Does linux support ACPI S4?
A: Yes. That's what echo platform > /sys/power/disk does.
Q: My machine doesn't work with ACPI. How can I use swsusp than ?
A: Do a reboot() syscall with right parameters. Warning: glibc gets in
its way, so check with strace:
reboot(LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, 0xd000fce2)
(Thanks to Peter Osterlund:)
#include <unistd.h>
#include <syscall.h>
#define LINUX_REBOOT_MAGIC1 0xfee1dead
#define LINUX_REBOOT_MAGIC2 672274793
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
int main()
{
syscall(SYS_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2,
LINUX_REBOOT_CMD_SW_SUSPEND, 0);
return 0;
}
Also /sys/ interface should be still present.
Q: What is 'suspend2'?
A: suspend2 is 'Software Suspend 2', a forked implementation of
@ -311,3 +286,46 @@ As a rule of thumb use encrypted swap to protect your data while your
system is shut down or suspended. Additionally use the encrypted
suspend image to prevent sensitive data from being stolen after
resume.
Q: Why can't we suspend to a swap file?
A: Because accessing swap file needs the filesystem mounted, and
filesystem might do something wrong (like replaying the journal)
during mount.
There are few ways to get that fixed:
1) Probably could be solved by modifying every filesystem to support
some kind of "really read-only!" option. Patches welcome.
2) suspend2 gets around that by storing absolute positions in on-disk
image (and blocksize), with resume parameter pointing directly to
suspend header.
Q: Is there a maximum system RAM size that is supported by swsusp?
A: It should work okay with highmem.
Q: Does swsusp (to disk) use only one swap partition or can it use
multiple swap partitions (aggregate them into one logical space)?
A: Only one swap partition, sorry.
Q: If my application(s) causes lots of memory & swap space to be used
(over half of the total system RAM), is it correct that it is likely
to be useless to try to suspend to disk while that app is running?
A: No, it should work okay, as long as your app does not mlock()
it. Just prepare big enough swap partition.
Q: What information is usefull for debugging suspend-to-disk problems?
A: Well, last messages on the screen are always useful. If something
is broken, it is usually some kernel driver, therefore trying with as
little as possible modules loaded helps a lot. I also prefer people to
suspend from console, preferably without X running. Booting with
init=/bin/bash, then swapon and starting suspend sequence manually
usually does the trick. Then it is good idea to try with latest
vanilla kernel.

View File

@ -46,6 +46,12 @@ There are a few types of systems where video works after S3 resume:
POSTing bios works. Ole Rohne has patch to do just that at
http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2.
(8) on some systems, you can use the video_post utility mentioned here:
http://bugzilla.kernel.org/show_bug.cgi?id=3670. Do echo 3 > /sys/power/state
&& /usr/sbin/video_post - which will initialize the display in console mode.
If you are in X, you can switch to a virtual terminal and back to X using
CTRL+ALT+F1 - CTRL+ALT+F7 to get the display working in graphical mode again.
Now, if you pass acpi_sleep=something, and it does not work with your
bios, you'll get a hard crash during resume. Be careful. Also it is
safest to do your experiments with plain old VGA console. The vesafb
@ -64,7 +70,8 @@ Model hack (or "how to do it")
------------------------------------------------------------------------------
Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI
Acer TM 242FX vbetool (6)
Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6)
Acer TM C110 video_post (8)
Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8)
Acer TM 4052LCi s3_bios (2)
Acer TM 636Lci s3_bios vga=normal (2)
Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back
@ -113,6 +120,7 @@ IBM ThinkPad T42p (2373-GTG) s3_bios (2)
IBM TP X20 ??? (*)
IBM TP X30 s3_bios (2)
IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight.
IBM TP X32 none (1), but backlight is on and video is trashed after long suspend
IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4)
Medion MD4220 ??? (*)
Samsung P35 vbetool needed (6)

View File

@ -134,7 +134,7 @@ pci_get_device_by_addr() will find the pci device associated
with that address (if any).
The default include/asm-ppc64/io.h macros readb(), inb(), insb(),
etc. include a check to see if the the i/o read returned all-0xff's.
etc. include a check to see if the i/o read returned all-0xff's.
If so, these make a call to eeh_dn_check_failure(), which in turn
asks the firmware if the all-ff's value is the sign of a true EEH
error. If it is not, processing continues as normal. The grand

View File

@ -468,7 +468,7 @@ The hex_ascii view shows the data field in hex and ascii representation
The raw view returns a bytestream as the debug areas are stored in memory.
The sprintf view formats the debug entries in the same way as the sprintf
function would do. The sprintf event/expection fuctions write to the
function would do. The sprintf event/expection functions write to the
debug entry a pointer to the format string (size = sizeof(long))
and for each vararg a long value. So e.g. for a debug entry with a format
string plus two varargs one would need to allocate a (3 * sizeof(long))

View File

@ -60,6 +60,8 @@ scsi.txt
- short blurb on using SCSI support as a module.
scsi_mid_low_api.txt
- info on API between SCSI layer and low level drivers
scsi_eh.txt
- info on SCSI midlayer error handling infrastructure
st.txt
- info on scsi tape driver
sym53c500_cs.txt

View File

@ -1,5 +1,5 @@
====================================================================
= Adaptec Aic7xxx Fast -> Ultra160 Family Manager Set v6.2.28 =
= Adaptec Aic7xxx Fast -> Ultra160 Family Manager Set v7.0 =
= README for =
= The Linux Operating System =
====================================================================
@ -131,6 +131,10 @@ The following information is available in this file:
SCSI "stub" effects.
2. Version History
7.0 (4th August, 2005)
- Updated driver to use SCSI transport class infrastructure
- Upported sequencer and core fixes from last adaptec released
version of the driver.
6.2.36 (June 3rd, 2003)
- Correct code that disables PCI parity error checking.
- Correct and simplify handling of the ignore wide residue

View File

@ -344,7 +344,7 @@
/proc/scsi/ibmmca/<host_no>. ibmmca_proc_info() provides this information.
This table is quite informative for interested users. It shows the load
of commands on the subsystem and wether you are running the bypassed
of commands on the subsystem and whether you are running the bypassed
(software) or integrated (hardware) SCSI-command set (see below). The
amount of accesses is shown. Read, write, modeselect is shown separately
in order to help debugging problems with CD-ROMs or tapedrives.

View File

@ -0,0 +1,479 @@
SCSI EH
======================================
This document describes SCSI midlayer error handling infrastructure.
Please refer to Documentation/scsi/scsi_mid_low_api.txt for more
information regarding SCSI midlayer.
TABLE OF CONTENTS
[1] How SCSI commands travel through the midlayer and to EH
[1-1] struct scsi_cmnd
[1-2] How do scmd's get completed?
[1-2-1] Completing a scmd w/ scsi_done
[1-2-2] Completing a scmd w/ timeout
[1-3] How EH takes over
[2] How SCSI EH works
[2-1] EH through fine-grained callbacks
[2-1-1] Overview
[2-1-2] Flow of scmds through EH
[2-1-3] Flow of control
[2-2] EH through hostt->eh_strategy_handler()
[2-2-1] Pre hostt->eh_strategy_handler() SCSI midlayer conditions
[2-2-2] Post hostt->eh_strategy_handler() SCSI midlayer conditions
[2-2-3] Things to consider
[1] How SCSI commands travel through the midlayer and to EH
[1-1] struct scsi_cmnd
Each SCSI command is represented with struct scsi_cmnd (== scmd). A
scmd has two list_head's to link itself into lists. The two are
scmd->list and scmd->eh_entry. The former is used for free list or
per-device allocated scmd list and not of much interest to this EH
discussion. The latter is used for completion and EH lists and unless
otherwise stated scmds are always linked using scmd->eh_entry in this
discussion.
[1-2] How do scmd's get completed?
Once LLDD gets hold of a scmd, either the LLDD will complete the
command by calling scsi_done callback passed from midlayer when
invoking hostt->queuecommand() or SCSI midlayer will time it out.
[1-2-1] Completing a scmd w/ scsi_done
For all non-EH commands, scsi_done() is the completion callback. It
does the following.
1. Delete timeout timer. If it fails, it means that timeout timer
has expired and is going to finish the command. Just return.
2. Link scmd to per-cpu scsi_done_q using scmd->en_entry
3. Raise SCSI_SOFTIRQ
SCSI_SOFTIRQ handler scsi_softirq calls scsi_decide_disposition() to
determine what to do with the command. scsi_decide_disposition()
looks at the scmd->result value and sense data to determine what to do
with the command.
- SUCCESS
scsi_finish_command() is invoked for the command. The
function does some maintenance choirs and notify completion by
calling scmd->done() callback, which, for fs requests, would
be HLD completion callback - sd:sd_rw_intr, sr:rw_intr,
st:st_intr.
- NEEDS_RETRY
- ADD_TO_MLQUEUE
scmd is requeued to blk queue.
- otherwise
scsi_eh_scmd_add(scmd, 0) is invoked for the command. See
[1-3] for details of this funciton.
[1-2-2] Completing a scmd w/ timeout
The timeout handler is scsi_times_out(). When a timeout occurs, this
function
1. invokes optional hostt->eh_timedout() callback. Return value can
be one of
- EH_HANDLED
This indicates that eh_timedout() dealt with the timeout. The
scmd is passed to __scsi_done() and thus linked into per-cpu
scsi_done_q. Normal command completion described in [1-2-1]
follows.
- EH_RESET_TIMER
This indicates that more time is required to finish the
command. Timer is restarted. This action is counted as a
retry and only allowed scmd->allowed + 1(!) times. Once the
limit is reached, action for EH_NOT_HANDLED is taken instead.
*NOTE* This action is racy as the LLDD could finish the scmd
after the timeout has expired but before it's added back. In
such cases, scsi_done() would think that timeout has occurred
and return without doing anything. We lose completion and the
command will time out again.
- EH_NOT_HANDLED
This is the same as when eh_timedout() callback doesn't exist.
Step #2 is taken.
2. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the
command. See [1-3] for more information.
[1-3] How EH takes over
scmds enter EH via scsi_eh_scmd_add(), which does the following.
1. Turns on scmd->eh_eflags as requested. It's 0 for error
completions and SCSI_EH_CANCEL_CMD for timeouts.
2. Links scmd->eh_entry to shost->eh_cmd_q
3. Sets SHOST_RECOVERY bit in shost->shost_state
4. Increments shost->host_failed
5. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
As can be seen above, once any scmd is added to shost->eh_cmd_q,
SHOST_RECOVERY shost_state bit is turned on. This prevents any new
scmd to be issued from blk queue to the host; eventually, all scmds on
the host either complete normally, fail and get added to eh_cmd_q, or
time out and get added to shost->eh_cmd_q.
If all scmds either complete or fail, the number of in-flight scmds
becomes equal to the number of failed scmds - i.e. shost->host_busy ==
shost->host_failed. This wakes up SCSI EH thread. So, once woken up,
SCSI EH thread can expect that all in-flight commands have failed and
are linked on shost->eh_cmd_q.
Note that this does not mean lower layers are quiescent. If a LLDD
completed a scmd with error status, the LLDD and lower layers are
assumed to forget about the scmd at that point. However, if a scmd
has timed out, unless hostt->eh_timedout() made lower layers forget
about the scmd, which currently no LLDD does, the command is still
active as long as lower layers are concerned and completion could
occur at any time. Of course, all such completions are ignored as the
timer has already expired.
We'll talk about how SCSI EH takes actions to abort - make LLDD
forget about - timed out scmds later.
[2] How SCSI EH works
LLDD's can implement SCSI EH actions in one of the following two
ways.
- Fine-grained EH callbacks
LLDD can implement fine-grained EH callbacks and let SCSI
midlayer drive error handling and call appropriate callbacks.
This will be dicussed further in [2-1].
- eh_strategy_handler() callback
This is one big callback which should perform whole error
handling. As such, it should do all choirs SCSI midlayer
performs during recovery. This will be discussed in [2-2].
Once recovery is complete, SCSI EH resumes normal operation by
calling scsi_restart_operations(), which
1. Checks if door locking is needed and locks door.
2. Clears SHOST_RECOVERY shost_state bit
3. Wakes up waiters on shost->host_wait. This occurs if someone
calls scsi_block_when_processing_errors() on the host.
(*QUESTION* why is it needed? All operations will be blocked
anyway after it reaches blk queue.)
4. Kicks queues in all devices on the host in the asses
[2-1] EH through fine-grained callbacks
[2-1-1] Overview
If eh_strategy_handler() is not present, SCSI midlayer takes charge
of driving error handling. EH's goals are two - make LLDD, host and
device forget about timed out scmds and make them ready for new
commands. A scmd is said to be recovered if the scmd is forgotten by
lower layers and lower layers are ready to process or fail the scmd
again.
To achieve these goals, EH performs recovery actions with increasing
severity. Some actions are performed by issueing SCSI commands and
others are performed by invoking one of the following fine-grained
hostt EH callbacks. Callbacks may be omitted and omitted ones are
considered to fail always.
int (* eh_abort_handler)(struct scsi_cmnd *);
int (* eh_device_reset_handler)(struct scsi_cmnd *);
int (* eh_bus_reset_handler)(struct scsi_cmnd *);
int (* eh_host_reset_handler)(struct scsi_cmnd *);
Higher-severity actions are taken only when lower-severity actions
cannot recover some of failed scmds. Also, note that failure of the
highest-severity action means EH failure and results in offlining of
all unrecovered devices.
During recovery, the following rules are followed
- Recovery actions are performed on failed scmds on the to do list,
eh_work_q. If a recovery action succeeds for a scmd, recovered
scmds are removed from eh_work_q.
Note that single recovery action on a scmd can recover multiple
scmds. e.g. resetting a device recovers all failed scmds on the
device.
- Higher severity actions are taken iff eh_work_q is not empty after
lower severity actions are complete.
- EH reuses failed scmds to issue commands for recovery. For
timed-out scmds, SCSI EH ensures that LLDD forgets about a scmd
before reusing it for EH commands.
When a scmd is recovered, the scmd is moved from eh_work_q to EH
local eh_done_q using scsi_eh_finish_cmd(). After all scmds are
recovered (eh_work_q is empty), scsi_eh_flush_done_q() is invoked to
either retry or error-finish (notify upper layer of failure) recovered
scmds.
scmds are retried iff its sdev is still online (not offlined during
EH), REQ_FAILFAST is not set and ++scmd->retries is less than
scmd->allowed.
[2-1-2] Flow of scmds through EH
1. Error completion / time out
ACTION: scsi_eh_scmd_add() is invoked for scmd
- set scmd->eh_eflags
- add scmd to shost->eh_cmd_q
- set SHOST_RECOVERY
- shost->host_failed++
LOCKING: shost->host_lock
2. EH starts
ACTION: move all scmds to EH's local eh_work_q. shost->eh_cmd_q
is cleared.
LOCKING: shost->host_lock (not strictly necessary, just for
consistency)
3. scmd recovered
ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd
- shost->host_failed--
- clear scmd->eh_eflags
- scsi_setup_cmd_retry()
- move from local eh_work_q to local eh_done_q
LOCKING: none
4. EH completes
ACTION: scsi_eh_flush_done_q() retries scmds or notifies upper
layer of failure.
- scmd is removed from eh_done_q and scmd->eh_entry is cleared
- if retry is necessary, scmd is requeued using
scsi_queue_insert()
- otherwise, scsi_finish_command() is invoked for scmd
LOCKING: queue or finish function performs appropriate locking
[2-1-3] Flow of control
EH through fine-grained callbacks start from scsi_unjam_host().
<<scsi_unjam_host>>
1. Lock shost->host_lock, splice_init shost->eh_cmd_q into local
eh_work_q and unlock host_lock. Note that shost->eh_cmd_q is
cleared by this action.
2. Invoke scsi_eh_get_sense.
<<scsi_eh_get_sense>>
This action is taken for each error-completed
(!SCSI_EH_CANCEL_CMD) commands without valid sense data. Most
SCSI transports/LLDDs automatically acquire sense data on
command failures (autosense). Autosense is recommended for
performance reasons and as sense information could get out of
sync inbetween occurrence of CHECK CONDITION and this action.
Note that if autosense is not supported, scmd->sense_buffer
contains invalid sense data when error-completing the scmd
with scsi_done(). scsi_decide_disposition() always returns
FAILED in such cases thus invoking SCSI EH. When the scmd
reaches here, sense data is acquired and
scsi_decide_disposition() is called again.
1. Invoke scsi_request_sense() which issues REQUEST_SENSE
command. If fails, no action. Note that taking no action
causes higher-severity recovery to be taken for the scmd.
2. Invoke scsi_decide_disposition() on the scmd
- SUCCESS
scmd->retries is set to scmd->allowed preventing
scsi_eh_flush_done_q() from retrying the scmd and
scsi_eh_finish_cmd() is invoked.
- NEEDS_RETRY
scsi_eh_finish_cmd() invoked
- otherwise
No action.
3. If !list_empty(&eh_work_q), invoke scsi_eh_abort_cmds().
<<scsi_eh_abort_cmds>>
This action is taken for each timed out command.
hostt->eh_abort_handler() is invoked for each scmd. The
handler returns SUCCESS if it has succeeded to make LLDD and
all related hardware forget about the scmd.
If a timedout scmd is successfully aborted and the sdev is
either offline or ready, scsi_eh_finish_cmd() is invoked for
the scmd. Otherwise, the scmd is left in eh_work_q for
higher-severity actions.
Note that both offline and ready status mean that the sdev is
ready to process new scmds, where processing also implies
immediate failing; thus, if a sdev is in one of the two
states, no further recovery action is needed.
Device readiness is tested using scsi_eh_tur() which issues
TEST_UNIT_READY command. Note that the scmd must have been
aborted successfully before reusing it for TEST_UNIT_READY.
4. If !list_empty(&eh_work_q), invoke scsi_eh_ready_devs()
<<scsi_eh_ready_devs>>
This function takes four increasingly more severe measures to
make failed sdevs ready for new commands.
1. Invoke scsi_eh_stu()
<<scsi_eh_stu>>
For each sdev which has failed scmds with valid sense data
of which scsi_check_sense()'s verdict is FAILED,
START_STOP_UNIT command is issued w/ start=1. Note that
as we explicitly choose error-completed scmds, it is known
that lower layers have forgotten about the scmd and we can
reuse it for STU.
If STU succeeds and the sdev is either offline or ready,
all failed scmds on the sdev are EH-finished with
scsi_eh_finish_cmd().
*NOTE* If hostt->eh_abort_handler() isn't implemented or
failed, we may still have timed out scmds at this point
and STU doesn't make lower layers forget about those
scmds. Yet, this function EH-finish all scmds on the sdev
if STU succeeds leaving lower layers in an inconsistent
state. It seems that STU action should be taken only when
a sdev has no timed out scmd.
2. If !list_empty(&eh_work_q), invoke scsi_eh_bus_device_reset().
<<scsi_eh_bus_device_reset>>
This action is very similar to scsi_eh_stu() except that,
instead of issuing STU, hostt->eh_device_reset_handler()
is used. Also, as we're not issuing SCSI commands and
resetting clears all scmds on the sdev, there is no need
to choose error-completed scmds.
3. If !list_empty(&eh_work_q), invoke scsi_eh_bus_reset()
<<scsi_eh_bus_reset>>
hostt->eh_bus_reset_handler() is invoked for each channel
with failed scmds. If bus reset succeeds, all failed
scmds on all ready or offline sdevs on the channel are
EH-finished.
4. If !list_empty(&eh_work_q), invoke scsi_eh_host_reset()
<<scsi_eh_host_reset>>
This is the last resort. hostt->eh_host_reset_handler()
is invoked. If host reset succeeds, all failed scmds on
all ready or offline sdevs on the host are EH-finished.
5. If !list_empty(&eh_work_q), invoke scsi_eh_offline_sdevs()
<<scsi_eh_offline_sdevs>>
Take all sdevs which still have unrecovered scmds offline
and EH-finish the scmds.
5. Invoke scsi_eh_flush_done_q().
<<scsi_eh_flush_done_q>>
At this point all scmds are recovered (or given up) and
put on eh_done_q by scsi_eh_finish_cmd(). This function
flushes eh_done_q by either retrying or notifying upper
layer of failure of the scmds.
[2-2] EH through hostt->eh_strategy_handler()
hostt->eh_strategy_handler() is invoked in the place of
scsi_unjam_host() and it is responsible for whole recovery process.
On completion, the handler should have made lower layers forget about
all failed scmds and either ready for new commands or offline. Also,
it should perform SCSI EH maintenance choirs to maintain integrity of
SCSI midlayer. IOW, of the steps described in [2-1-2], all steps
except for #1 must be implemented by eh_strategy_handler().
[2-2-1] Pre hostt->eh_strategy_handler() SCSI midlayer conditions
The following conditions are true on entry to the handler.
- Each failed scmd's eh_flags field is set appropriately.
- Each failed scmd is linked on scmd->eh_cmd_q by scmd->eh_entry.
- SHOST_RECOVERY is set.
- shost->host_failed == shost->host_busy
[2-2-2] Post hostt->eh_strategy_handler() SCSI midlayer conditions
The following conditions must be true on exit from the handler.
- shost->host_failed is zero.
- Each scmd's eh_eflags field is cleared.
- Each scmd is in such a state that scsi_setup_cmd_retry() on the
scmd doesn't make any difference.
- shost->eh_cmd_q is cleared.
- Each scmd->eh_entry is cleared.
- Either scsi_queue_insert() or scsi_finish_command() is called on
each scmd. Note that the handler is free to use scmd->retries and
->allowed to limit the number of retries.
[2-2-3] Things to consider
- Know that timed out scmds are still active on lower layers. Make
lower layers forget about them before doing anything else with
those scmds.
- For consistency, when accessing/modifying shost data structure,
grab shost->host_lock.
- On completion, each failed sdev must have forgotten about all
active scmds.
- On completion, each failed sdev must be ready for new commands or
offline.
--
Tejun Heo
htejun@gmail.com
11th September 2005

View File

@ -373,13 +373,11 @@ Summary:
scsi_activate_tcq - turn on tag command queueing
scsi_add_device - creates new scsi device (lu) instance
scsi_add_host - perform sysfs registration and SCSI bus scan.
scsi_add_timer - (re-)start timer on a SCSI command.
scsi_adjust_queue_depth - change the queue depth on a SCSI device
scsi_assign_lock - replace default host_lock with given lock
scsi_bios_ptable - return copy of block device's partition table
scsi_block_requests - prevent further commands being queued to given host
scsi_deactivate_tcq - turn off tag command queueing
scsi_delete_timer - cancel timer on a SCSI command.
scsi_host_alloc - return a new scsi_host instance whose refcount==1
scsi_host_get - increments Scsi_Host instance's refcount
scsi_host_put - decrements Scsi_Host instance's refcount (free if 0)
@ -457,27 +455,6 @@ struct scsi_device * scsi_add_device(struct Scsi_Host *shost,
int scsi_add_host(struct Scsi_Host *shost, struct device * dev)
/**
* scsi_add_timer - (re-)start timer on a SCSI command.
* @scmd: pointer to scsi command instance
* @timeout: duration of timeout in "jiffies"
* @complete: pointer to function to call if timeout expires
*
* Returns nothing
*
* Might block: no
*
* Notes: Each scsi command has its own timer, and as it is added
* to the queue, we set up the timer. When the command completes,
* we cancel the timer. An LLD can use this function to change
* the existing timeout value.
*
* Defined in: drivers/scsi/scsi_error.c
**/
void scsi_add_timer(struct scsi_cmnd *scmd, int timeout,
void (*complete)(struct scsi_cmnd *))
/**
* scsi_adjust_queue_depth - allow LLD to change queue depth on a SCSI device
* @sdev: pointer to SCSI device to change queue depth on
@ -565,24 +542,6 @@ void scsi_block_requests(struct Scsi_Host * shost)
void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
/**
* scsi_delete_timer - cancel timer on a SCSI command.
* @scmd: pointer to scsi command instance
*
* Returns 1 if able to cancel timer else 0 (i.e. too late or already
* cancelled).
*
* Might block: no [may in the future if it invokes del_timer_sync()]
*
* Notes: All commands issued by upper levels already have a timeout
* associated with them. An LLD can use this function to cancel the
* timer.
*
* Defined in: drivers/scsi/scsi_error.c
**/
int scsi_delete_timer(struct scsi_cmnd *scmd)
/**
* scsi_host_alloc - create a scsi host adapter instance and perform basic
* initialization.

View File

@ -111,24 +111,17 @@ hardware.
Interrupts: locally disabled.
This call must not sleep
stop_tx(port,tty_stop)
stop_tx(port)
Stop transmitting characters. This might be due to the CTS
line becoming inactive or the tty layer indicating we want
to stop transmission.
tty_stop: 1 if this call is due to the TTY layer issuing a
TTY stop to the driver (equiv to rs_stop).
to stop transmission due to an XOFF character.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
start_tx(port,tty_start)
start transmitting characters. (incidentally, nonempty will
always be nonzero, and shouldn't be used - it will be dropped).
tty_start: 1 if this call was due to the TTY layer issuing
a TTY start to the driver (equiv to rs_start)
start_tx(port)
start transmitting characters.
Locking: port->lock taken.
Interrupts: locally disabled.

View File

@ -99,6 +99,7 @@ statically linked into the kernel). Those options are:
SONYPI_MEYE_MASK 0x0400
SONYPI_MEMORYSTICK_MASK 0x0800
SONYPI_BATTERY_MASK 0x1000
SONYPI_WIRELESS_MASK 0x2000
useinput: if set (which is the default) two input devices are
created, one which interprets the jogdial events as
@ -137,6 +138,15 @@ Bugs:
speed handling etc). Use ACPI instead of APM if it works on your
laptop.
- sonypi lacks the ability to distinguish between certain key
events on some models.
- some models with the nvidia card (geforce go 6200 tc) uses a
different way to adjust the backlighting of the screen. There
is a userspace utility to adjust the brightness on those models,
which can be downloaded from
http://www.acc.umu.se/~erikw/program/smartdimmer-0.1.tar.bz2
- since all development was done by reverse engineering, there is
_absolutely no guarantee_ that this driver will not crash your
laptop. Permanently.

View File

@ -132,6 +132,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
mpu_irq - IRQ # for MPU-401 UART (PnP setup)
dma1 - first DMA # for AD1816A chip (PnP setup)
dma2 - second DMA # for AD1816A chip (PnP setup)
clockfreq - Clock frequency for AD1816A chip (default = 0, 33000Hz)
Module supports up to 8 cards, autoprobe and PnP.
@ -1458,7 +1459,7 @@ devices where %i is sound card number from zero to seven.
To auto-load an ALSA driver for OSS services, define the string
'sound-slot-%i' where %i means the slot number for OSS, which
corresponds to the card index of ALSA. Usually, define this
as the the same card module.
as the same card module.
An example configuration for a single emu10k1 card is like below:
----- /etc/modprobe.conf

View File

@ -3422,10 +3422,17 @@ struct _snd_pcm_runtime {
<para>
The <structfield>iface</structfield> field specifies the type of
the control,
<constant>SNDRV_CTL_ELEM_IFACE_XXX</constant>. There are
<constant>MIXER</constant>, <constant>PCM</constant>,
<constant>CARD</constant>, etc.
the control, <constant>SNDRV_CTL_ELEM_IFACE_XXX</constant>, which
is usually <constant>MIXER</constant>.
Use <constant>CARD</constant> for global controls that are not
logically part of the mixer.
If the control is closely associated with some specific device on
the sound card, use <constant>HWDEP</constant>,
<constant>PCM</constant>, <constant>RAWMIDI</constant>,
<constant>TIMER</constant>, or <constant>SEQUENCER</constant>, and
specify the device number with the
<structfield>device</structfield> and
<structfield>subdevice</structfield> fields.
</para>
<para>

View File

@ -57,7 +57,7 @@ With BK, you can just get it from
and DaveJ has tar-balls at
http://www.codemonkey.org.uk/projects/bitkeeper/sparse/
http://www.codemonkey.org.uk/projects/git-snapshots/sparse/
Once you have it, just do

View File

@ -171,7 +171,7 @@ the header 'include/linux/sysrq.h', this will define everything else you need.
Next, you must create a sysrq_key_op struct, and populate it with A) the key
handler function you will use, B) a help_msg string, that will print when SysRQ
prints help, and C) an action_msg string, that will print right before your
handler is called. Your handler must conform to the protoype in 'sysrq.h'.
handler is called. Your handler must conform to the prototype in 'sysrq.h'.
After the sysrq_key_op is created, you can call the macro
register_sysrq_key(int key, struct sysrq_key_op *op_p) that is defined in

View File

@ -2176,7 +2176,7 @@
If you want to access files on the host machine from inside UML, you
can treat it as a separate machine and either nfs mount directories
from the host or copy files into the virtual machine with scp or rcp.
However, since UML is running on the the host, it can access those
However, since UML is running on the host, it can access those
files just like any other process and make them available inside the
virtual machine without needing to use the network.

View File

@ -20,7 +20,7 @@ License along with this program; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
MA 02111-1307 USA.
This document and the the gadget serial driver itself are
This document and the gadget serial driver itself are
Copyright (C) 2004 by Al Borchers (alborchers@steinerpoint.com).
If you have questions, problems, or suggestions for this driver

View File

@ -297,18 +297,21 @@ S: SerialNumber=dce0
C:* #Ifs= 1 Cfg#= 1 Atr=40 MxPwr= 0mA
I: If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub
E: Ad=81(I) Atr=03(Int.) MxPS= 8 Ivl=255ms
T: Bus=00 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 2 Spd=12 MxCh= 4
D: Ver= 1.00 Cls=09(hub ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1
P: Vendor=0451 ProdID=1446 Rev= 1.00
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
I: If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub
E: Ad=81(I) Atr=03(Int.) MxPS= 1 Ivl=255ms
T: Bus=00 Lev=02 Prnt=02 Port=00 Cnt=01 Dev#= 3 Spd=1.5 MxCh= 0
D: Ver= 1.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1
P: Vendor=04b4 ProdID=0001 Rev= 0.00
C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA
I: If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=02 Driver=mouse
E: Ad=81(I) Atr=03(Int.) MxPS= 3 Ivl= 10ms
T: Bus=00 Lev=02 Prnt=02 Port=02 Cnt=02 Dev#= 4 Spd=12 MxCh= 0
D: Ver= 1.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1
P: Vendor=0565 ProdID=0001 Rev= 1.08

View File

@ -126,10 +126,12 @@ card=124 - AverMedia AverTV DVB-T 761
card=125 - MATRIX Vision Sigma-SQ
card=126 - MATRIX Vision Sigma-SLC
card=127 - APAC Viewcomp 878(AMAX)
card=128 - DVICO FusionHDTV DVB-T Lite
card=128 - DViCO FusionHDTV DVB-T Lite
card=129 - V-Gear MyVCD
card=130 - Super TV Tuner
card=131 - Tibet Systems 'Progress DVR' CS16
card=132 - Kodicom 4400R (master)
card=133 - Kodicom 4400R (slave)
card=134 - Adlink RTV24
card=135 - DViCO FusionHDTV 5 Lite
card=136 - Acorp Y878F

View File

@ -62,3 +62,6 @@
61 -> Philips TOUGH DVB-T reference design [1131:2004]
62 -> Compro VideoMate TV Gold+II
63 -> Kworld Xpert TV PVR7134
64 -> FlyTV mini Asus Digimatrix [1043:0210,1043:0210]
65 -> V-Stream Studio TV Terminator
66 -> Yuan TUN-900 (saa7135)

View File

@ -64,3 +64,4 @@ tuner=62 - Philips TEA5767HN FM Radio
tuner=63 - Philips FMD1216ME MK3 Hybrid Tuner
tuner=64 - LG TDVS-H062F/TUA6034
tuner=65 - Ymec TVF66T5-B/DFF
tuner=66 - LG NTSC (TALN mini series)

View File

@ -222,7 +222,7 @@ was introduced in 1991, is used in the DC10 old
can generate: PAL , NTSC , SECAM
The adv717x, should be able to produce PAL N. But you find nothing PAL N
specific in the the registers. Seem that you have to reuse a other standard
specific in the registers. Seem that you have to reuse a other standard
to generate PAL N, maybe it would work if you use the PAL M settings.
==========================

View File

@ -83,19 +83,18 @@ single address space optimization, so that the zap_page_range (from
vmtruncate) does not lose sending ipi's to cloned threads that might
be spawned underneath it and go to user mode to drag in pte's into tlbs.
swap_list_lock/swap_device_lock
-------------------------------
swap_lock
--------------
The swap devices are chained in priority order from the "swap_list" header.
The "swap_list" is used for the round-robin swaphandle allocation strategy.
The #free swaphandles is maintained in "nr_swap_pages". These two together
are protected by the swap_list_lock.
are protected by the swap_lock.
The swap_device_lock, which is per swap device, protects the reference
counts on the corresponding swaphandles, maintained in the "swap_map"
array, and the "highest_bit" and "lowest_bit" fields.
The swap_lock also protects all the device reference counts on the
corresponding swaphandles, maintained in the "swap_map" array, and the
"highest_bit" and "lowest_bit" fields.
Both of these are spinlocks, and are never acquired from intr level. The
locking hierarchy is swap_list_lock -> swap_device_lock.
The swap_lock is a spinlock, and is never acquired from intr level.
To prevent races between swap space deletion or async readahead swapins
deciding whether a swap handle is being used, ie worthy of being read in

Some files were not shown because too many files have changed in this diff Show More