This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.
This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.
Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as the
thread calling sys_membarrier. It executes faster than the non-expedited
variant (no blocking). It also works on NOHZ_FULL configurations.
Scheduler-wise, it requires a memory barrier before and after context
switching between processes (which have different mm). The memory
barrier before context switch is already present. For the barrier after
context switch:
* Our TSO archs can do RELEASE without being a full barrier. Look at
x86 spin_unlock() being a regular STORE for example. But for those
archs, all atomics imply smp_mb and all of them have atomic ops in
switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
barrier.
* From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
the rest does indeed do smp_mb(), so there the spin_unlock() is a full
barrier and we're good.
* ARM64 has a very heavy barrier in switch_to(), which suffices.
* PPC just removed its barrier from switch_to(), but appears to be
talking about adding something to switch_mm(). So add a
smp_mb__after_unlock_lock() for now, until this is settled on the PPC
side.
Changes since v3:
- Properly document the memory barriers provided by each architecture.
Changes since v2:
- Address comments from Peter Zijlstra,
- Add smp_mb__after_unlock_lock() after finish_lock_switch() in
finish_task_switch() to add the memory barrier we need after storing
to rq->curr. This is much simpler than the previous approach relying
on atomic_dec_and_test() in mmdrop(), which actually added a memory
barrier in the common case of switching between userspace processes.
- Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
kernel, rather than having the whole membarrier system call returning
-ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
Adapt the CMD_QUERY mask accordingly.
Changes since v1:
- move membarrier code under kernel/sched/ because it uses the
scheduler runqueue,
- only add the barrier when we switch from a kernel thread. The case
where we switch from a user-space thread is already handled by
the atomic_dec_and_test() in mmdrop().
- add a comment to mmdrop() documenting the requirement on the implicit
memory barrier.
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dave Watson <davejwatson@fb.com>