We want to centralize the isolation features, to be done by the housekeeping
subsystem and scheduler domain isolation is a significant part of it.
No intended behaviour change, we just reuse the housekeeping cpumask
and core code.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to centralize the isolation management, done by the housekeeping
subsystem. Therefore we need to handle the nohz_full= parameter from
there.
Since nohz_full= so far has involved unbound timers, watchdog, RCU
and tilegx NAPI isolation, we keep that default behaviour.
nohz_full= will be deprecated in the future. We want to control
the isolation features from the isolcpus= parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-10-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before we implement isolcpus under housekeeping, we need the isolation
features to be more finegrained. For example some people want NOHZ_FULL
without the full scheduler isolation, others want full scheduler
isolation without NOHZ_FULL.
So let's cut all these isolation features piecewise, at the risk of
overcutting it right now. We can still merge some flags later if they
always make sense together.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Split the housekeeping config from CONFIG_NO_HZ_FULL. This way we finally
separate the isolation code from NOHZ.
Although a dependency to CONFIG_NO_HZ_FULL remains for now, while the
housekeeping code still deals with NOHZ internals.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-8-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fit it into the housekeeping_*() namespace.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Housekeeping code still depends on the nohz_full static key. Since we want
to decouple housekeeping from NOHZ, let's create a housekeeping specific
static key.
It's mostly relevant for calls to is_housekeeping_cpu() from the scheduler.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nobody needs to access this detail. housekeeping_cpumask() already
takes care of it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While trying to disable the watchog on nohz_full CPUs, the watchdog
implements an ad-hoc version of housekeeping_cpumask(). Lets replace
those re-invented lines.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The housekeeping code is currently tied to the NOHZ code. As we are
planning to make housekeeping independent from it, start with moving
the relevant code to its own file.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpulist_parse() uses nr_cpumask_bits as a limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depending upon the CONFIG_CPUMASK_OFFSTACK option:
- If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is the same as
NR_CPUS, which might not represent the # of CPUs that really exist
(default 64). So, there's a chance of a gap between nr_cpu_ids
and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
operation. For example, if isolcpus=9 is passed on an 8 cpu
system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
that it's supposed to.
This patch fixes this bug by finding the last CPU of the passed
isolcpus= list and checking it against nr_cpu_ids.
It also fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since CPU numbering starts from 0.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adobriyan@gmail.com
Cc: akpm@linux-foundation.org
Cc: longman@redhat.com
Cc: mka@chromium.org
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20171023130154.9050-1-rakib.mullick@gmail.com
[ Enhanced the changelog and the kernel message. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
include/linux/cpumask.h | 16 ++++++++++++++++
kernel/sched/topology.c | 4 ++--
2 files changed, 18 insertions(+), 2 deletions(-)
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_idlest_group() returns NULL when the local group is idlest. The
caller then continues the find_idlest_group() search at a lower level
of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
not consulted and, crucially, @new_cpu is not updated. This means the
search is pointless and we return @prev_cpu from select_task_rq_fair().
This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When 'p' is not allowed on any of the CPUs in the sched_domain, we
currently return NULL from find_idlest_group(), and pointlessly
continue the search on lower sched_domain levels (where 'p' is also not
allowed) before returning prev_cpu regardless (as we have not updated
new_cpu).
Add an explicit check for this case, and add a comment to
find_idlest_group(). Now when find_idlest_group() returns NULL, it always
means that the local group is allowed and idlest.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the local group is not allowed we do not modify this_*_load from
their initial value of 0. That means that the load checks at the end
of find_idlest_group cause us to incorrectly return NULL. Fixing the
initial values to ULONG_MAX means we will instead return the idlest
remote group in that case.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit:
83a0a96a5f ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")
find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
so we can simplify the checking of the return value in find_idlest_cpu().
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for changes that would otherwise require adding a new
level of indentation to the while(sd) loop, create a new function
find_idlest_cpu() which contains this loop, and rename the existing
find_idlest_cpu() to find_idlest_group_cpu().
Code inside the while(sd) loop is unchanged. @new_cpu is added as a
variable in the new function, with the same initial value as the
@new_cpu in select_task_rq_fair().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "goto force_balance" here is intended to mitigate the fact that
avg_load calculations can result in bad placement decisions when
priority is asymmetrical.
The original commit that adds it:
fab476228b ("sched: Force balancing on newidle balance if local group has capacity")
explains:
Under certain situations, such as a niced down task (i.e. nice =
-15) in the presence of nr_cpus NICE0 tasks, the niced task lands
on a sched group and kicks away other tasks because of its large
weight. This leads to sub-optimal utilization of the
machine. Even though the sched group has capacity, it does not
pull tasks because sds.this_load >> sds.max_load, and f_b_g()
returns NULL.
A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
capacity) systems - consider 8 always-running, same-priority tasks on a
system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
the "big" CPUs (which will be represented by one sched_group in the DIE
sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
one CPU unused. Because the "big" group has a higher group_capacity its
avg_load may not present an imbalance that would cause migrating a
task to the idle "little".
The force_balance case here solves the problem but currently only for
CPU_NEWLY_IDLE balances, which in theory might never happen on the
unused CPU. Including CPU_IDLE in the force_balance case means
there's an upper bound on the time before we can attempt to solve the
underutilization: after DIE's sd->balance_interval has passed the
next nohz balance kick will help us out.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use task_util() in find_idlest_group() via capacity_spare_wake().
This task_util() updated in wake_cap(). However wake_cap() is not the
only reason for ending up in find_idlest_group() - we could have been sent
there by wake_wide(). So explicitly sync the task util with prev_cpu
when we are about to head to find_idlest_group().
We could simply do this at the beginning of
select_task_rq_fair() (i.e. irrespective of whether we're heading to
select_idle_sibling() or find_idlest_group() & co), but I didn't want to
slow down the select_idle_sibling() path more than necessary.
Don't do this during fork balancing, we won't need the task_util and
we'd just clobber the last_update_time, which is supposed to be 0.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andres Oportus <andresoportus@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As a first step this patch makes cfs_tasks list as MRU one.
It means, that when a next task is picked to run on physical
CPU it is moved to the front of the list.
Therefore, the cfs_tasks list is more or less sorted (except
woken tasks) starting from recently given CPU time tasks toward
tasks with max wait time in a run-queue, i.e. MRU list.
Second, as part of the load balance operation, this approach
starts detach_tasks()/detach_one_task() from the tail of the
queue instead of the head, giving some advantages:
- tends to pick a task with highest wait time;
- tasks located in the tail are less likely cache-hot,
therefore the can_migrate_task() decision is higher.
hackbench illustrates slightly better performance. For example
doing 1000 samples and 40 groups on i5-3320M CPU, it shows below
figures:
default: 0.657 avg
patched: 0.646 avg
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
upto 8 cores (16 threads) with the following topology.
----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------
Here, there are 2 last-level (L3) caches per logical NUMA node.
A socket can contain upto 4 NUMA nodes, and a system can support
upto 2 sockets. With full system configuration, current scheduler
creates 4 sched domains:
domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NUMA (span a socket: 4 nodes)
domain3 NUMA (span a system: 8 nodes)
Note that there is no domain to represent cpus spaning a logical
NUMA node. With this hierarchy of sched domains, the scheduler does
not balance properly in the following cases:
Case1:
When running 8 tasks, a properly balanced system should
schedule a task per logical NUMA node. This is not the case for
the current scheduler.
Case2:
In some cases, threads are scheduled on the same cpu, while other
cpus are idle. This results in run-to-run inconsistency. For example:
taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
--cpu-max-prime=100000 run
Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
on 8 cpus.
Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a logical NUMA node. This results in the following
hierarchy of sched domains on the same system described above.
domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NODE (span a logical NUMA node)
domain3 NUMA (span a socket: 4 nodes)
domain4 NUMA (span a system: 8 nodes)
This fixes the improper load balancing cases mentioned above.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The normal x86_topology on NHM+ machines degenerates because the MC
and CPU domains are of the same size, therefore MC inherits
SD_PREFER_SIBLING from CPU (which then gets taken out). The result is
that we'll spread tasks across the first NUMA level in order to
maximize cache utilization.
However, for the x86_numa_in_package_topology we loose the CPU domain,
and we'll not have SD_PREFER_SIBLING set anywhere, giving a distinct
difference in behaviour.
Commit:
8e7fbcbc22 ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs")
made a fail by not preserving the SD_PREFER_SIBLING for the !power_saving
case on both CPU and MC.
Then commit:
6956dc568f ("sched/numa: Add SD_PERFER_SIBLING to CPU domain")
adds it back to the CPU but not MC.
Restore that now, such that we get consistent spreading behaviour wrt
L3 and NUMA.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__dl_sub() is more meaningful as a name, and is more consistent
with the naming of the dual function (__dl_add()).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1504778971-13573-4-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix a bug introduced in:
72f9f3fdc9 ("sched/deadline: Remove dl_new from struct sched_dl_entity")
After that commit, when switching to -deadline if the scheduling
deadline of a task is in the past then switched_to_dl() calls
setup_new_entity() to properly initialize the scheduling deadline
and runtime.
The problem is that the task is enqueued _before_ having its parameters
initialized by setup_new_entity(), and this can cause problems.
For example, a task with its out-of-date deadline in the past will
potentially be enqueued as the highest priority one; however, its
adjusted deadline may not be the earliest one.
This patch fixes the problem by initializing the task's parameters before
enqueuing it.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1504778971-13573-3-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve requested better names for the new task-state helper functions.
So introduce the concept of task-state index for the printing and
rename __get_task_state() to task_state_index() and
__task_state_to_char() to task_index_to_char().
Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170929115016.pzlqc7ss3ccystyg@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
quiet_vmstat() is an expensive function that only makes sense when we
go into NOHZ.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aubrey.li@linux.intel.com
Cc: cl@linux.com
Cc: fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The trivial wake_affine_idle() implementation is very good for a
number of workloads, but it comes apart at the moment there are no
idle CPUs left, IOW. the overloaded case.
hackbench:
NO_WA_WEIGHT WA_WEIGHT
hackbench-20 : 7.362717561 seconds 6.450509391 seconds
(win)
netperf:
NO_WA_WEIGHT WA_WEIGHT
TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4
TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41
TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
TCP_RR-80 : Avg: 25330.3 Avg: 26867.9
UDP_RR-1 : Avg: 108266 Avg: 107844
UDP_RR-10 : Avg: 95480 Avg: 95245.2
UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
UDP_RR-40 : Avg: 76231 Avg: 75419.1
UDP_RR-80 : Avg: 34578.3 Avg: 35639.1
UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97
(wins and losses)
sysbench:
NO_WA_WEIGHT WA_WEIGHT
sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.
sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.
(increase on the top end)
tbench:
NO_WA_WEIGHT
Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms
WA_WEIGHT
Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms
(increase on the top end)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Eric reported a sysbench regression against commit:
3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
against his v3.10 enterprise kernel.
PRE (current tip/master):
ivb-ep sysbench:
2: [30 secs] transactions: 64110 (2136.94 per sec.)
5: [30 secs] transactions: 143644 (4787.99 per sec.)
10: [30 secs] transactions: 274298 (9142.93 per sec.)
20: [30 secs] transactions: 418683 (13955.45 per sec.)
40: [30 secs] transactions: 320731 (10690.15 per sec.)
80: [30 secs] transactions: 355096 (11834.28 per sec.)
hsw-ex NAS:
OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
lu.C.x_threads_144_run_3.log: Time in seconds = 433.83
POST (+patch):
ivb-ep sysbench:
2: [30 secs] transactions: 64494 (2149.75 per sec.)
5: [30 secs] transactions: 145114 (4836.99 per sec.)
10: [30 secs] transactions: 278311 (9276.69 per sec.)
20: [30 secs] transactions: 437169 (14571.60 per sec.)
40: [30 secs] transactions: 669837 (22326.73 per sec.)
80: [30 secs] transactions: 631739 (21055.88 per sec.)
hsw-ex NAS:
lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
lu.C.x_threads_144_run_3.log: Time in seconds = 22.52
This patch takes out all the shiny wake_affine() stuff and goes back to
utter basics. Between the two CPUs involved with the wakeup (the CPU
doing the wakeup and the CPU we ran on previously) pick the CPU we can
run on _now_.
This restores much of the regressions against the older kernels,
but leaves some ground in the overloaded case. The default-enabled
WA_WEIGHT (which will be introduced in the next patch) is an attempt
to address the overloaded situation.
Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jinpuwang@gmail.com
Cc: vcaputo@pengaru.com
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Fix object leak on IPSEC offload failure, from Steffen Klassert.
2) Fix range checks in ipset address range addition operations, from
Jozsef Kadlecsik.
3) Fix pernet ops unregistration order in ipset, from Florian Westphal.
4) Add missing netlink attribute policy for nl80211 packet pattern
attrs, from Peng Xu.
5) Fix PPP device destruction race, from Guillaume Nault.
6) Write marks get lost when BPF verifier processes R1=R2 register
assignments, causing incorrect liveness information and less state
pruning. Fix from Alexei Starovoitov.
7) Fix blockhole routes so that they are marked dead and therefore not
cached in sockets, otherwise IPSEC stops working. From Steffen
Klassert.
8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
net: thunderx: mark expected switch fall-throughs in nicvf_main()
udp: fix bcast packet reception
netlink: do not set cb_running if dump's start() errs
ipv4: Fix traffic triggered IPsec connections.
ipv6: Fix traffic triggered IPsec connections.
ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Revert commit 1a8b6d76dc ("net:add one common config...")
ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
ixgbe: Return error when getting PHY address if PHY access is not supported
netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
tipc: Unclone message at secondary destination lookup
tipc: correct initialization of skb list
gso: fix payload length when gso_size is zero
mlxsw: spectrum_router: Avoid expensive lookup during route removal
bpf: fix liveness marking
doc: Fix typo "8023.ad" in bonding documentation
ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
...
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
Fedorenko.
2) Fix splat with mark restoration in xt_socket with non-full-sock,
patch from Subash Abhinov Kasiviswanathan.
3) ipset bogusly bails out when adding IPv4 range containing more than
2^31 addresses, from Jozsef Kadlecsik.
4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
5) Races between dump and swap in ipset results in BUG_ON splats, from
Ross Lagerwall.
6) Fix chain renames in nf_tables, from JingPiao Chen.
7) Fix race in pernet codepath with ebtables table registration, from
Artem Savkov.
8) Memory leak in error path in set name allocation in nf_tables, patch
from Arvind Yadav.
9) Don't dump chain counters if they are not available, this fixes a
crash when listing the ruleset.
10) Fix out of bound memory read in strlcpy() in x_tables compat code,
from Eric Dumazet.
11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
Lin Zhang.
12) Cannot load rules incrementally anymore after xt_bpf with pinned
objects, added in revision 1. From Shmulik Ladkani.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2c16d60332 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.
However this breaks subsequent iptables calls:
# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.
That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.
However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.
One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.
However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.
It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.
Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
[2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2
Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
while processing Rx = Ry instruction the verifier does
regs[insn->dst_reg] = regs[insn->src_reg]
which often clears write mark (when Ry doesn't have it)
that was just set by check_reg_arg(Rx) prior to the assignment.
That causes mark_reg_read() to keep marking Rx in this block as
REG_LIVE_READ (since the logic incorrectly misses that it's
screened by the write) and in many of its parents (until lucky
write into the same Rx or beginning of the program).
That causes is_state_visited() logic to miss many pruning opportunities.
Furthermore mark_reg_read() logic propagates the read mark
for BPF_REG_FP as well (though it's readonly) which causes
harmless but unnecssary work during is_state_visited().
Note that do_propagate_liveness() skips FP correctly,
so do the same in mark_reg_read() as well.
It saves 0.2 seconds for the test below
program before after
bpf_lb-DLB_L3.o 2604 2304
bpf_lb-DLB_L4.o 11159 3723
bpf_lb-DUNKNOWN.o 1116 1110
bpf_lxc-DDROP_ALL.o 34566 28004
bpf_lxc-DUNKNOWN.o 53267 39026
bpf_netdev.o 17843 16943
bpf_overlay.o 8672 7929
time ~11 sec ~4 sec
Fixes: dc503a8ad9 ("bpf/verifier: track liveness for pruning")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull watchddog clean-up and fixes from Thomas Gleixner:
"The watchdog (hard/softlockup detector) code is pretty much broken in
its current state. The patch series addresses this by removing all
duct tape and refactoring it into a workable state.
The reasons why I ask for inclusion that late in the cycle are:
1) The code causes lockdep splats vs. hotplug locking which get
reported over and over. Unfortunately there is no easy fix.
2) The risk of breakage is minimal because it's already broken
3) As 4.14 is a long term stable kernel, I prefer to have working
watchdog code in that and the lockdep issues resolved. I wouldn't
ask you to pull if 4.14 wouldn't be a LTS kernel or if the
solution would be easy to backport.
4) The series was around before the merge window opened, but then got
delayed due to the UP failure caused by the for_each_cpu()
surprise which we discussed recently.
Changes vs. V1:
- Addressed your review points
- Addressed the warning in the powerpc code which was discovered late
- Changed two function names which made sense up to a certain point
in the series. Now they match what they do in the end.
- Fixed a 'unused variable' warning, which got not detected by the
intel robot. I triggered it when trying all possible related config
combinations manually. Randconfig testing seems not random enough.
The changes have been tested by and reviewed by Don Zickus and tested
and acked by Micheal Ellerman for powerpc"
* 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
watchdog/core: Put softlockup_threads_initialized under ifdef guard
watchdog/core: Rename some softlockup_* functions
powerpc/watchdog: Make use of watchdog_nmi_probe()
watchdog/core, powerpc: Lock cpus across reconfiguration
watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
watchdog/hardlockup/perf: Cure UP damage
watchdog/hardlockup: Clean up hotplug locking mess
watchdog/hardlockup/perf: Simplify deferred event destroy
watchdog/hardlockup/perf: Use new perf CPU enable mechanism
watchdog/hardlockup/perf: Implement CPU enable replacement
watchdog/hardlockup/perf: Implement init time detection of perf
watchdog/hardlockup/perf: Implement init time perf validation
watchdog/core: Get rid of the racy update loop
watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
watchdog/sysctl: Clean up sysctl variable name space
watchdog/sysctl: Get rid of the #ifdeffery
watchdog/core: Clean up header mess
watchdog/core: Further simplify sysctl handling
watchdog/core: Get rid of the thread teardown/setup dance
...
This fixes a code ordering issue in the main suspend-to-idle loop
that causes some "low power S0 idle" conditions to be incorrectly
reported as unmet with suspend/resume debug messages enabled.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZ1rGKAAoJEILEb/54YlRxONQP/0NzL79PrxFgtbsnhZZfDZ+U
os6pcNYl9J+n2YVq5NAP9nnqEhN3E2gctJwjYMRIuJC/g6uU8Z6Ym6h7D+QfZrv1
yzEsbQgLh8N9nR+lUGRi+meoF8BolOLeXnNgB18uP1ZShZLikvAELxMkmJUx+TjW
CVQaOkXe4I/Ey5O4Jjur+tgVn+ik+xl40akw7+wHAnY+I7KwzLffP7nwHBGavHsd
dCtbcRogWzpcihgpLpgJMaixjZXakJ2n/Zmg+IdpnYt9WRIMy2ztTu3+bPRPEVVJ
hcP9p93r1BZclkyyYNyM0QMv4Ac96xBOe8qigXP/9EdtWYHLeV+N+N+EFbeutHgn
LsiuO4h9FCrv4ltn7jm88sTyna0IpuKSva9pWk9nopI8e5r/Yplvi00ZJW1lz/B7
xrPWHdm6ozuK7UGQQoeeb5CVOuClTCBC3PFTw/XTD0emogjI+xJi802zIG00eRPm
VBAvMilLeS1ezegVWJte6wPptF3wUW2Ss6jYzh4zVgnJ5XKgfxbANhLeQizAhGFj
lxxW9/MS1APxTrV9VluY7zyqq/s9H1iWUMyf2H06zjetBfKPiqEn+oL4/k4rnYN6
SB4WrT8CwWxsDusMxN0aVjfftRwnGQ1+kPX3EP7mYQFP+zwiK44iilL7D37Rdyoe
Z8Hdmdf/sCORK3zXZWd8
=6lm4
-----END PGP SIGNATURE-----
Merge tag 'pm-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes a code ordering issue in the main suspend-to-idle loop that
causes some "low power S0 idle" conditions to be incorrectly reported
as unmet with suspend/resume debug messages enabled"
* tag 'pm-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / s2idle: Invoke the ->wake() platform callback earlier
Pull networking fixes from David Miller:
1) Check iwlwifi 9000 reorder buffer out-of-space condition properly,
from Sara Sharon.
2) Fix RCU splat in qualcomm rmnet driver, from Subash Abhinov
Kasiviswanathan.
3) Fix session and tunnel release races in l2tp, from Guillaume Nault
and Sabrina Dubroca.
4) Fix endian bug in sctp_diag_dump(), from Dan Carpenter.
5) Several mlx5 driver fixes from the Mellanox folks (max flow counters
cap check, invalid memory access in IPoIB support, etc.)
6) tun_get_user() should bail if skb->len is zero, from Alexander
Potapenko.
7) Fix RCU lookups in inetpeer, from Eric Dumazet.
8) Fix locking in packet_do_bund().
9) Handle cb->start() error properly in netlink dump code, from Jason
A. Donenfeld.
10) Handle multicast properly in UDP socket early demux code. From Paolo
Abeni.
11) Several erspan bug fixes in ip_gre, from Xin Long.
12) Fix use-after-free in socket filter code, in order to handle the
fact that listener lock is no longer taken during the three-way TCP
handshake. From Eric Dumazet.
13) Fix infoleak in RTM_GETSTATS, from Nikolay Aleksandrov.
14) Fix tail call generation in x86-64 BPF JIT, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: 8021q: skip packets if the vlan is down
bpf: fix bpf_tail_call() x64 JIT
net: stmmac: dwmac-rk: Add RK3128 GMAC support
rndis_host: support Novatel Verizon USB730L
net: rtnetlink: fix info leak in RTM_GETSTATS call
socket, bpf: fix possible use after free
mlxsw: spectrum_router: Track RIF of IPIP next hops
mlxsw: spectrum_router: Move VRF refcounting
net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'
net: mvpp2: Fix clock resource by adding an optional bus clock
r8152: add Linksys USB3GIGV1 id
l2tp: fix l2tp_eth module loading
ip_gre: erspan device should keep dst
ip_gre: set tunnel hlen properly in erspan_tunnel_init
ip_gre: check packet length and mtu correctly in erspan_xmit
ip_gre: get key from session_id correctly in erspan_rcv
tipc: use only positive error codes in messages
ppp: fix __percpu annotation
udp: perform source validation for mcast early demux
IPv4: early demux can return an error code
...
Merge misc fixes from Andrew Morton:
"A lot of stuff, sorry about that. A week on a beach, then a bunch of
time catching up then more time letting it bake in -next. Shan't do
that again!"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits)
include/linux/fs.h: fix comment about struct address_space
checkpatch: fix ignoring cover-letter logic
m32r: fix build failure
lib/ratelimit.c: use deferred printk() version
kernel/params.c: improve STANDARD_PARAM_DEF readability
kernel/params.c: fix an overflow in param_attr_show
kernel/params.c: fix the maximum length in param_get_string
mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
kernel/kcmp.c: drop branch leftover typo
memremap: add scheduling point to devm_memremap_pages
mm, page_alloc: add scheduling point to memmap_init_zone
mm, memory_hotplug: add scheduling point to __add_pages
lib/idr.c: fix comment for idr_replace()
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
lib/lz4: make arrays static const, reduces object code size
exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
...
- A memory fix with left over code from spliting out ftrace_ops
and function graph tracer, where the function graph tracer could
reset the trampoline pointer, leaving the old trampoline not to
be freed (memory leak).
- The update to Paul's patch that added the unnecessary READ_ONCE().
This removes the unnecessary READ_ONCE() instead of having to rebase
the branch to update the patch that added it.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEQEw9Eu0DdyUUkuUUybkF8mrZjcsFAlnU++sUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQybkF8mrZjcujzgf/ebIzGKe5vQKNrL4ITAcIz0T7Hvzl
pWw4uJp8kqO9x9EHMnztAkltQigvjvgDKZozJpUGgtNsFLuvdgQSBMK24YV8vLHs
UmXEnQ2tSB/2Sg2ccEnpjVXaMzL9aqlbeTmACbdd9UgZnvPiUYPejq2jFfECFQjb
k/gZT911ukBtx4mXYKzGFbTEZHdc/YUs6Y/wzB1ox5BBIUh71ZDZXxQTUHfXHlwS
Cst69/9dKl4nBEGDGas6/95iR+ORVv85osI/pqPtjSj4EkRnWfVRotaH1kNuSQil
gDIHSoy35NfXJx77/5IFHfrjFBAkr0IYRNL/jZaWazwM7rdqfAN8TwMQuA==
=4CtF
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.14-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixlets from Steven Rostedt:
"Two updates:
- A memory fix with left over code from spliting out ftrace_ops and
function graph tracer, where the function graph tracer could reset
the trampoline pointer, leaving the old trampoline not to be freed
(memory leak).
- The update to Paul's patch that added the unnecessary READ_ONCE().
This removes the unnecessary READ_ONCE() instead of having to
rebase the branch to update the patch that added it"
* tag 'trace-v4.14-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}()
ftrace: Fix kmemleak in unregister_ftrace_graph
The function names made sense up to the point where the watchdog
(re)configuration was unified to use softlockup_reconfigure_threads() for
all configuration purposes. But that includes scenarios which solely
configure the nmi watchdog.
Rename softlockup_reconfigure_threads() and softlockup_init_threads() so
the function names match the functionality.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
on powerpc because it is called multiple times for the boot CPU.
The first call is via:
start_wd_on_cpu+0x80/0x2f0
watchdog_nmi_reconfigure+0x124/0x170
softlockup_reconfigure_threads+0x110/0x130
lockup_detector_init+0xbc/0xe0
kernel_init_freeable+0x18c/0x37c
kernel_init+0x2c/0x160
ret_from_kernel_thread+0x5c/0xbc
And then again via the CPU hotplug registration:
start_wd_on_cpu+0x80/0x2f0
cpuhp_invoke_callback+0x194/0x620
cpuhp_thread_fun+0x7c/0x1b0
smpboot_thread_fn+0x290/0x2a0
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0xbc
This can be avoided by setting up the cpu hotplug state with nocalls and
move the initialization to the watchdog_nmi_probe() function. That
initializes the hotplug callbacks without invoking the callback and the
following core initialization function then configures the watchdog for the
online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
Reported-and-tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Instead of dropping the cpu hotplug lock after stopping NMI watchdog and
threads and reaquiring for restart, the code and the protection rules
become more obvious when holding cpu hotplug lock across the full
reconfiguration.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710022105570.2114@nanos
The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
into two stages. One to stop the NMI and one to restart it after
reconfiguration. That was done by adding a boolean 'run' argument to the
code, which is functionally correct but not necessarily a piece of art.
Replace it by two explicit functions: watchdog_nmi_stop() and
watchdog_nmi_start().
Fixes: 6592ad2fcc ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
Requested-by: Linus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos
Function param_attr_show could overflow the buffer it is operating on.
The buffer size is PAGE_SIZE, and the string returned by
attribute->param->ops->get is generated by scnprintf(buffer, PAGE_SIZE,
...) so it could be PAGE_SIZE - 1 long, with the terminating '\0' at the
very end of the buffer. Calling strcat(..., "\n") on this isn't safe, as
the '\0' will be replaced by '\n' (OK) and then another '\0' will be added
past the end of the buffer (not OK.)
Simply add the trailing '\n' when writing the attribute contents to the
buffer originally. This is safe, and also faster.
Credits to Teradata for discovering this issue.
Link: http://lkml.kernel.org/r/20170928162602.60c379c7@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The length parameter of strlcpy() is supposed to reflect the size of the
target buffer, not of the source string. Harmless in this case as the
buffer is PAGE_SIZE long and the source string is always much shorter than
this, but conceptually wrong, so let's fix it.
Link: http://lkml.kernel.org/r/20170928162515.24846b4f@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The else branch been left over and escaped the source code refresh. Not
a problem but better clean it up.
Fixes: 0791e3644e ("kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files")
Link: http://lkml.kernel.org/r/20170917165838.GA1887@uranus.lan
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>