59 lines
2.3 KiB
Plaintext
59 lines
2.3 KiB
Plaintext
|
Using RCU's CPU Stall Detector
|
||
|
|
||
|
The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables
|
||
|
RCU's CPU stall detector, which detects conditions that unduly delay
|
||
|
RCU grace periods. The stall detector's idea of what constitutes
|
||
|
"unduly delayed" is controlled by a pair of C preprocessor macros:
|
||
|
|
||
|
RCU_SECONDS_TILL_STALL_CHECK
|
||
|
|
||
|
This macro defines the period of time that RCU will wait from
|
||
|
the beginning of a grace period until it issues an RCU CPU
|
||
|
stall warning. It is normally ten seconds.
|
||
|
|
||
|
RCU_SECONDS_TILL_STALL_RECHECK
|
||
|
|
||
|
This macro defines the period of time that RCU will wait after
|
||
|
issuing a stall warning until it issues another stall warning.
|
||
|
It is normally set to thirty seconds.
|
||
|
|
||
|
RCU_STALL_RAT_DELAY
|
||
|
|
||
|
The CPU stall detector tries to make the offending CPU rat on itself,
|
||
|
as this often gives better-quality stack traces. However, if
|
||
|
the offending CPU does not detect its own stall in the number
|
||
|
of jiffies specified by RCU_STALL_RAT_DELAY, then other CPUs will
|
||
|
complain. This is normally set to two jiffies.
|
||
|
|
||
|
The following problems can result in an RCU CPU stall warning:
|
||
|
|
||
|
o A CPU looping in an RCU read-side critical section.
|
||
|
|
||
|
o A CPU looping with interrupts disabled.
|
||
|
|
||
|
o A CPU looping with preemption disabled.
|
||
|
|
||
|
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
|
||
|
without invoking schedule().
|
||
|
|
||
|
o A bug in the RCU implementation.
|
||
|
|
||
|
o A hardware failure. This is quite unlikely, but has occurred
|
||
|
at least once in a former life. A CPU failed in a running system,
|
||
|
becoming unresponsive, but not causing an immediate crash.
|
||
|
This resulted in a series of RCU CPU stall warnings, eventually
|
||
|
leading the realization that the CPU had failed.
|
||
|
|
||
|
The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
|
||
|
SRCU does not do so directly, but its calls to synchronize_sched() will
|
||
|
result in RCU-sched detecting any CPU stalls that might be occurring.
|
||
|
|
||
|
To diagnose the cause of the stall, inspect the stack traces. The offending
|
||
|
function will usually be near the top of the stack. If you have a series
|
||
|
of stall warnings from a single extended stall, comparing the stack traces
|
||
|
can often help determine where the stall is occurring, which will usually
|
||
|
be in the function nearest the top of the stack that stays the same from
|
||
|
trace to trace.
|
||
|
|
||
|
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE.
|