Commit Graph

949021 Commits

Author SHA1 Message Date
Vincent Guittot 5a7f555904 sched/fair: Relax constraint on task's load during load balance
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
load balancer currently migrates the waiting task between the CPUs in an
almost random manner. The success of a rq pulling a task depends of the
value of nr_balance_failed of its domains and its ability to be faster
than others to detach it. This behavior results in an unfair distribution
of the running time between tasks because some CPUs will run most of the
time, if not always, the same task whereas others will share their time
between several tasks.

Instead of using nr_balance_failed as a boolean to relax the condition
for detaching task, the LB will use nr_balanced_failed to relax the
threshold between the tasks'load and the imbalance. This mecanism
prevents the same rq or domain to always win the load balance fight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
2020-09-25 14:23:25 +02:00
Xianting Tian fe7491580d sched/fair: Remove the force parameter of update_tg_load_avg()
In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
sometimes update_tg_load_avg(cfs_rq, false) is used.
update_tg_load_avg() has the parameter force, but in current code,
it never set 1 or true to it, so remove the force parameter.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
2020-09-25 14:23:25 +02:00
Xunlei Pang df3cb4ea1f sched/fair: Fix wrong cpu selecting from isolated domain
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
   (thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
2020-09-25 14:23:25 +02:00
YueHaibing 51bd5121c4 sched: Remove unused inline function uclamp_bucket_base_value()
There is no caller in tree, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
2020-09-25 14:23:25 +02:00
Daniel Bristot de Oliveira 2586af1ac1 sched/rt: Disable RT_RUNTIME_SHARE by default
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
between CPUs, allowing a CPU to run a real-time task up to 100% of the
time while leaving more space for non-real-time tasks to run on the CPU
that lend rt_runtime.

The problem is that a CPU can easily borrow enough rt_runtime to allow
a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
which are non-real-time by design.

This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
The feature will still be present for users that want to enable it,
though.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Wei Wang <wvw@google.com>
Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
2020-09-25 14:23:24 +02:00
Lucas Stach 46fcc4b00c sched/deadline: Fix stale throttling on de-/boosted tasks
When a boosted task gets throttled, what normally happens is that it's
immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
runtime and clears the dl_throttled flag. There is a special case however:
if the throttling happened on sched-out and the task has been deboosted in
the meantime, the replenish is skipped as the task will return to its
normal scheduling class. This leaves the task with the dl_throttled flag
set.

Now if the task gets boosted up to the deadline scheduling class again
while it is sleeping, it's still in the throttled state. The normal wakeup
however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
actually place it on the rq. Thus we end up with a task that is runnable,
but not actually on the rq and neither a immediate replenishment happens,
nor is the replenishment timer set up, so the task is stuck in
forever-throttled limbo.

Clear the dl_throttled flag before dropping back to the normal scheduling
class to fix this issue.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
2020-09-25 14:23:24 +02:00
Vincent Guittot 8e0e0eda6a sched/numa: Use runnable_avg to classify node
Use runnable_avg to classify numa node state similarly to what is done for
normal load balancer. This helps to ensure that numa and normal balancers
use the same view of the state of the system.

Large arm64system: 2 nodes / 224 CPUs:

  hackbench -l (256000/#grp) -g #grp

  grp    tip/sched/core         +patchset              improvement
  1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
  4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
  16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
  32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
  64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
  128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
  256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200921072959.16317-1-vincent.guittot@linaro.org
2020-09-25 14:23:24 +02:00
Valentin Schneider 848785df48 sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
The last sd_flag_debug shuffle inadvertently moved its definition within
an #ifdef CONFIG_SYSCTL region. While CONFIG_SYSCTL is indeed required to
produce the sched domain ctl interface (which uses sd_flag_debug to output
flag names), it isn't required to run any assertion on the sched_domain
hierarchy itself.

Move the definition of sd_flag_debug to a CONFIG_SCHED_DEBUG region of
topology.c.

Now at long last we have:

- sd_flag_debug declared in include/linux/sched/topology.h iff
  CONFIG_SCHED_DEBUG=y
- sd_flag_debug defined in kernel/sched/topology.c, conditioned by:
  - CONFIG_SCHED_DEBUG, with an explicit #ifdef block
  - CONFIG_SMP, as a requirement to compile topology.c

With this change, all symbols pertaining to SD flag metadata (with the
exception of __SD_FLAG_CNT) are now defined exclusively within topology.c

Fixes: 8fca9494d4 ("sched/topology: Move sd_flag_debug out of linux/sched/topology.h")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200908184956.23369-1-valentin.schneider@arm.com
2020-09-09 10:09:03 +02:00
Daniel Bristot de Oliveira 153908ebc8 MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
As discussed with Juri and Peter.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/4476a6da70949913a59dab9aacfbd12162c1fbd7.1599146667.git.bristot@redhat.com
2020-09-04 14:35:40 +02:00
Valentin Schneider 4fc472f121 sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
SD_DEGENERATE_GROUPS_MASK is only useful for sched/topology.c, but still
gets defined for anyone who imports topology.h, leading to a flurry of
unused variable warnings.

Move it out of the header and place it next to the SD degeneration
functions in sched/topology.c.

Fixes: 4ee4ea443a ("sched/topology: Introduce SD metaflag for flags needing > 1 groups")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-2-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Valentin Schneider 8fca9494d4 sched/topology: Move sd_flag_debug out of linux/sched/topology.h
Defining an array in a header imported all over the place clearly is a daft
idea, that still didn't stop me from doing it.

Leave a declaration of sd_flag_debug in topology.h and move its definition
to sched/debug.c.

Fixes: b6e862f386 ("sched/topology: Define and assign sched_domain flag metadata")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Sebastian Andrzej Siewior c1cecf884a sched: Cache task_struct::flags in sched_submit_work()
sched_submit_work() is considered to be a hot path. The preempt_disable()
instruction is a compiler barrier and forces the compiler to load
task_struct::flags for the second comparison.
By using a local variable, the compiler can load the value once and keep it in
a register for the second comparison.

Verified on x86-64 with gcc-10.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819200025.lqvmyefqnbok5i4f@linutronix.de
2020-08-26 12:41:58 +02:00
Sebastian Andrzej Siewior 01ccf59236 sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together
The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.

By moving the bit definition within the same block, the compiler can
create/use one immediate value.

For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".

Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819195505.y3fxk72sotnrkczi@linutronix.de
2020-08-26 12:41:58 +02:00
Jiang Biao 1724b95b92 sched/fair: Simplify the work when reweighting entity
The code in reweight_entity() can be simplified.

For a sched entity on the rq, the entity accounting can be replaced by
cfs_rq instantaneous load updates currently called from within the
entity accounting.

Even though an entity on the rq can't represent a task in
reweight_entity() (a task is always dequeued before calling this
function) and so the numa task accounting and the rq->cfs_tasks list
management of the entity accounting are never called, the redundant
cfs_rq->nr_running decrement/increment will be avoided.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200811113209.34057-1-benbjiang@tencent.com
2020-08-26 12:41:58 +02:00
Lukasz Luba da0777d35f sched/fair: Fix wrong negative conversion in find_energy_efficient_cpu()
In find_energy_efficient_cpu() 'cpu_cap' could be less that 'util'.
It might be because of RT, DL (so higher sched class than CFS), irq or
thermal pressure signal, which reduce the capacity value.
In such situation the result of 'cpu_cap - util' might be negative but
stored in the unsigned long. Then it might be compared with other unsigned
long when uclamp_rq_util_with() reduced the 'util' such that is passes the
fits_capacity() check.

Prevent this situation and make the arithmetic more safe.

Fixes: 1d42509e47 ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200810083004.26420-1-lukasz.luba@arm.com
2020-08-26 12:41:57 +02:00
Josh Don ec73240b16 sched/fair: Ignore cache hotness for SMT migration
SMT siblings share caches, so cache hotness should be irrelevant for
cross-sibling migration.

Signed-off-by: Josh Don <joshdon@google.com>
Proposed-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200804193413.510651-1-joshdon@google.com
2020-08-26 12:41:57 +02:00
Valentin Schneider 5f4a1c4ea4 sched/topology: Mark SD_NUMA as SDF_NEEDS_GROUPS
There would be no point in preserving a sched_domain with a single group
just because it has this flag set. Add it to SD_DEGENERATE_GROUPS_MASK.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-17-valentin.schneider@arm.com
2020-08-19 10:49:50 +02:00
Valentin Schneider 3551e954f5 sched/topology: Mark SD_OVERLAP as SDF_NEEDS_GROUPS
A sched_domain can only have overlapping sched_groups if it has more than
one group.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-16-valentin.schneider@arm.com
2020-08-19 10:49:50 +02:00
Valentin Schneider 33199b0143 sched/topology: Mark SD_ASYM_PACKING as SDF_NEEDS_GROUPS
Being a load-balancing flag, it requires 2+ groups to have any effect.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-15-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider bdb7c802cc sched/topology: Mark SD_SERIALIZE as SDF_NEEDS_GROUPS
There would be no point in preserving a sched_domain with a single group
just because it has this flag set. Add it to SD_DEGENERATE_GROUPS_MASK.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-14-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider 94b858fea1 sched/topology: Mark SD_BALANCE_WAKE as SDF_NEEDS_GROUPS
Even if no mainline topology uses this flag, it is a load balancing flag
just like SD_BALANCE_FORK and requires 2+ groups to have any effect.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-13-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider 3a6712c768 sched/topology: Mark SD_PREFER_SIBLING as SDF_NEEDS_GROUPS
SD_PREFER_SIBLING is currently considered in sd_parent_degenerate() but not
in sd_degenerate(). It too hinges on load balancing, and thus won't have
any effect when set on a domain with a single group. Add it to
SD_DEGENERATE_GROUPS_MASK.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-12-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider c200191d4c sched/topology: Propagate SD_ASYM_CPUCAPACITY upwards
We currently set this flag *only* on domains whose topology level exactly
match the level where we detect asymmetry (as returned by
asym_cpu_capacity_level()). This is rather problematic.

Say there are two clusters in the system, one with a lone big CPU and the
other with a mix of big and LITTLE CPUs (as is allowed by DynamIQ):

  DIE [                ]
  MC  [             ][ ]
       0   1   2   3  4
       L   L   B   B  B

asym_cpu_capacity_level() will figure out that the MC level is the one
where all CPUs can see a CPU of max capacity, and we will thus set
SD_ASYM_CPUCAPACITY at MC level for all CPUs.

That lone big CPU will degenerate its MC domain, since it would be alone in
there, and will end up with just a DIE domain. Since the flag was only set
at MC, this CPU ends up not seeing any SD with the flag set, which is
broken.

Rather than clearing dflags at every topology level, clear it before
entering the topology level loop. This will properly propagate upwards
flags that are set starting from a certain level.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-11-valentin.schneider@arm.com
2020-08-19 10:49:49 +02:00
Valentin Schneider ab65afb094 sched/topology: Remove SD_SERIALIZE degeneration special case
If there is only a single NUMA node in the system, the only NUMA topology
level that will be generated will be NODE (identity distance), which
doesn't have SD_SERIALIZE.

This means we don't need this special case in sd_parent_degenerate(), as
having the NODE level "naturally" covers it. Thus, remove it.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-10-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider 6f34981862 sched/topology: Use prebuilt SD flag degeneration mask
Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
sd_parent_degenerate().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-9-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider 4ee4ea443a sched/topology: Introduce SD metaflag for flags needing > 1 groups
In preparation of cleaning up the sd_degenerate*() functions, mark flags
used in sd_degenerate() with the new SDF_NEEDS_GROUPS flag. With this,
build a compile-time mask of those SD flags.

Note that sd_parent_degenerate() uses an extra flag in its mask,
SD_PREFER_SIBLING, which remains singled out for now.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-8-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider 5b9f8ff7b3 sched/debug: Output SD flag names rather than their values
Decoding the output of /proc/sys/kernel/sched_domain/cpu*/domain*/flags has
always been somewhat annoying, as one needs to go fetch the bit -> name
mapping from the source code itself. This encoding can be saved in a script
somewhere, but that isn't safe from flags being added, removed or even
shuffled around.

What matters for debugging purposes is to get *which* flags are set in a
given domain, their associated value is pretty much meaningless.

Make the sd flags debug file output flag names.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-7-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider 65c5e25316 sched/topology: Verify SD_* flags setup when sched_debug is on
Now that we have some description of what we expect the flags layout to
be, we can use that to assert at runtime that the actual layout is sane.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-6-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider b6e862f386 sched/topology: Define and assign sched_domain flag metadata
There are some expectations regarding how sched domain flags should be laid
out, but none of them are checked or asserted in
sched_domain_debug_one(). After staring at said flags for a while, I've
come to realize there's two repeating patterns:

- Shared with children: those flags are set from the base CPU domain
  upwards. Any domain that has it set will have it set in its children. It
  hints at "some property holds true / some behaviour is enabled until this
  level".

- Shared with parents: those flags are set from the topmost domain
  downwards. Any domain that has it set will have it set in its parents. It
  hints at "some property isn't visible / some behaviour is disabled until
  this level".

There are two outliers that (currently) do not map to either of these:

o SD_PREFER_SIBLING, which is cleared below levels with
  SD_ASYM_CPUCAPACITY. The change was introduced by commit:

    9c63e84db2 ("sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains")

  as it could break misfit migration on some systems. In light of this, we
  might want to change it back to make it fit one of the two categories and
  fix the issue another way.

o SD_ASYM_CPUCAPACITY, which gets set on a single level and isn't
  propagated up nor down. From a topology description point of view, it
  really wants to be SDF_SHARED_PARENT; this will be rectified in a later
  patch.

Tweak the sched_domain flag declaration to assign each flag an expected
layout, and include the rationale for each flag "meta type" assignment as a
comment. Consolidate the flag metadata into an array; the index of a flag's
metadata can easily be found with log2(flag), IOW __ffs(flag).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-5-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Valentin Schneider d54a9658a7 sched/topology: Split out SD_* flags declaration to its own file
To associate the SD flags with some metadata, we need some more structure
in the way they are declared.

Rather than shove that in a free-standing macro list, move the declaration
in a separate file that can be re-imported with different SD_FLAG
definitions. This is inspired by what is done with the syscall
table (see uapi/asm/unistd.h and sys_call_table).

The value assigned to a given SD flag now depends on the order it appears
in sd_flags.h. No change in functionality.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-4-valentin.schneider@arm.com
2020-08-19 10:49:47 +02:00
Valentin Schneider d23b3bf8e4 ARM, sched/topology: Revert back to default scheduler topology
The ARM-specific GMC level is meant to be built using the thread sibling
mask, but no devicetree in arch/arm/boot/dts uses the 'thread' cpu-map
binding. With SD_SHARE_POWERDOMAIN gone, this topology level can be
removed, at which point ARM no longer benefits from having a custom defined
topology table.

Delete the GMC topology level by making ARM use the default scheduler
topology table. This essentially reverts commit:

  fb2aa85564 ("sched, ARM: Create a dedicated scheduler topology table")

No change in functionality is expected.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-3-valentin.schneider@arm.com
2020-08-19 10:49:47 +02:00
Valentin Schneider cfe7ddcbd7 ARM, sched/topology: Remove SD_SHARE_POWERDOMAIN
This flag was introduced in 2014 by commit:

  d77b3ed5c9 ("sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain")

but AFAIA it was never leveraged by the scheduler. The closest thing I can
think of is EAS caring about frequency domains, and it does that by
leveraging performance domains.

Remove the flag. No change in functionality is expected.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-2-valentin.schneider@arm.com
2020-08-19 10:49:47 +02:00
Linus Torvalds 18445bf405 spi: Fixes for v5.9
A bunch of fixes that came in for SPI during the merge window, a bunch
 from ST and others for their controller, one from Lukas for a race
 between device addition and controller unregistration and one from fix
 from Geert for the DT bindings which unbreaks validation.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAl88HlMTHGJyb29uaWVA
 a2VybmVsLm9yZwAKCRAk1otyXVSH0LSaB/9aKqZmi7DUz1mguWny26NdYwBfYjW/
 tZzpK/wfdwOoaxnxlSpZjA1tTOgjIFKQK1mN3adkKyqh1KByokSMHN0jp9nTM/BM
 VyYid0jv0mnoANXCUWueQMcGxE990cRGbrJoywEY47VdGBSxGUdOiv/NukgZv8wa
 z0ijmA7phTe1cCavp5rzB/fdNbOj4STg0ErgArVrafXV1E/fHvnjjTtPf2RtXWTU
 LuUBw51Uo1wBZch9gDcvqiBhyfuXxk7ik+U0e0nRVeRTTw0F/ZpVqpob95mHyWm+
 YuDjn/SRyZRpIdr9uxwpSEUxNB6sowAs5MJOcxesjSHJBIU77WAwX7bA
 =BjOG
 -----END PGP SIGNATURE-----

Merge tag 'spi-fix-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A bunch of fixes that came in for SPI during the merge window.

  Some from ST and others for their controller, one from Lukas for a
  race between device addition and controller unregistration and one
  from fix from Geert for the DT bindings which unbreaks validation"

* tag 'spi-fix-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  dt-bindings: lpspi: Add missing boolean type for fsl,spi-only-use-cs1-sel
  spi: stm32: always perform registers configuration prior to transfer
  spi: stm32: fixes suspend/resume management
  spi: stm32: fix stm32_spi_prepare_mbr in case of odd clk_rate
  spi: stm32: fix fifo threshold level in case of short transfer
  spi: stm32h7: fix race condition at end of transfer
  spi: stm32: clear only asserted irq flags on interrupt
  spi: Prevent adding devices below an unregistering controller
2020-08-18 14:27:12 -07:00
Linus Torvalds 9899b58758 Fix regression in IA-64 caused by page table allocation refactoring
The refactoring and consolidation of <asm/pgalloc.h> caused regression
 on parisc and ia64. The fix for parisc made it into v5.9-rc1 while the
 fix ia64 got delayed a bit and here it is.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAl87qP8THHJwcHRAbGlu
 dXguaWJtLmNvbQAKCRA5A4Ymyw79keEuB/4yCQjJbW+0y+DaivCjUtAPpBek+tpn
 sYW5nWFiE4JlTfZf1ujQ4QpHHkL9y8OXo3zfE1G49lCT97j1GveMpEu68k31+YjO
 uhLY9mv7kFlkuDTgrdkCKOJg+e8HmM6NRUEGJ+Vxoo03S9jpqWjckAURGaluEI/w
 uoI8f64Apc6bjGiOygUIva73B2RGjWOnfoVecy0h7EMRrRHgQV/sJQWHRKVcHKok
 3Mdnswzg07jmR55GTJtm0rO78h/+5B0FCgr1iOK9oo0tMkxSk3DS0Xy7eHhlPaX4
 x1gqhi11XsqDowJfMr/WyfpjGFUp2/yeRoNux1L34zFdxxxKCNowvXk0
 =h7gw
 -----END PGP SIGNATURE-----

Merge tag 'fixes-2020-08-18' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull ia64 page table fix from Mike Rapoport:
 "Fix regression in IA-64 caused by page table allocation refactoring

  The refactoring and consolidation of <asm/pgalloc.h> caused regression
  on parisc and ia64. The fix for parisc made it into v5.9-rc1 while the
  fix ia64 got delayed a bit and here it is"

* tag 'fixes-2020-08-18' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  arch/ia64: Restore arch-specific pgd_offset_k implementation
2020-08-18 12:05:46 -07:00
Yang Shi b7333b58f3 mm/memory.c: skip spurious TLB flush for retried page fault
Recently we found regression when running will_it_scale/page_fault3 test
on ARM64.  Over 70% down for the multi processes cases and over 20% down
for the multi threads cases.  It turns out the regression is caused by
commit 89b15332af ("mm: drop mmap_sem before calling
balance_dirty_pages() in write fault").

The test mmaps a memory size file then write to the mapping, this would
make all memory dirty and trigger dirty pages throttle, that upstream
commit would release mmap_sem then retry the page fault.  The retried
page fault would see correct PTEs installed then just fall through to
spurious TLB flush.  The regression is caused by the excessive spurious
TLB flush.  It is fine on x86 since x86's spurious TLB flush is no-op.

We could just skip the spurious TLB flush to mitigate the regression.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Xu Yu <xuyu@linux.alibaba.com>
Debugged-by: Xu Yu <xuyu@linux.alibaba.com>
Tested-by: Xu Yu <xuyu@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-18 12:02:27 -07:00
Linus Torvalds 06a4ec1d9d mailmap alphabetizing and addition
-----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl87EbYWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJseID/44aCrTkW6B0z7i3+AwLBnQnY6p
 g5QI1lipnAcna9RUu5JcIcbXu4p+cMcFf3Ewj6Ohcc4dWAAoSrwdo4uNO6pFvjtB
 yKyX4MYdN55ZAHLroRD1yOn+TSFPGx66VhwVUxL4fwCewRA8w33ZP6bUY3I9+iJX
 v86SOxVdWwf2TW9rTh6wM2lMAwVGhqdG1pt660+smw3NJSQCrjrdFN/ZURoPud5n
 2UJimgNwqrwvRd23oX5J4HXvyjNGIkhfJdGwkw10jQVLuZgsLgy6rx7JFGOuDBJi
 LL8QsKRzJfpysLQOaBnQJrYcHvRCEkUSXq7NvTqxsStK7z40cQTx73yL9FxS3rKT
 KGa7uvqazVhnS2AD+sQnzkQ9TgGtPZDfXRkN0MLU/PtSx6LXm+k0GxxVdwXstOKR
 ax8X0XqmkZOZdX0E0d906aRnkvpVMBzcke0P1NxUN/N+LH1vrQjnx7FCnRzh+eri
 KNxGazpYCbbBCrsjqHtfyjkDGasZrjWiV07++dEd1CYG9/Gbrx9OjhC6Ic4IhMsg
 7Udy3ODZrJsbtH4vUoPfY2/r8rX1YMtbPKWBVB3v8mMmgbpGPcFquuuVytwCpVMj
 GdG28mVsskXPBuGe+FBNcPv+zZu3L0Zj/SRPkvT6/khqSWV1ws1B57A4rKa5gV9j
 agEKTCgIHcip3Pn+BQ==
 =OLbG
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull mailmap update from Kees Cook:
 "This was originally part of my pstore tree, but when I realized that
  mailmap needed re-alphabetizing, I decided to wait until -rc1 to send
  this, as I saw a lot of mailmap additions pending in -next for the
  merge window.

  It's a programmatic reordering and the addition of a pstore
  contributor's preferred email address"

* tag 'pstore-v5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  mailmap: Add WeiXiong Liao
  mailmap: Restore dictionary sorting
2020-08-17 17:15:23 -07:00
Linus Torvalds 4cf7562190 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Another batch of fixes:

  1) Remove nft_compat counter flush optimization, it generates warnings
     from the refcount infrastructure. From Florian Westphal.

  2) Fix BPF to search for build id more robustly, from Jiri Olsa.

  3) Handle bogus getopt lengths in ebtables, from Florian Westphal.

  4) Infoleak and other fixes to j1939 CAN driver, from Eric Dumazet and
     Oleksij Rempel.

  5) Reset iter properly on mptcp sendmsg() error, from Florian
     Westphal.

  6) Show a saner speed in bonding broadcast mode, from Jarod Wilson.

  7) Various kerneldoc fixes in bonding and elsewhere, from Lee Jones.

  8) Fix double unregister in bonding during namespace tear down, from
     Cong Wang.

  9) Disable RP filter during icmp_redirect selftest, from David Ahern"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (75 commits)
  otx2_common: Use devm_kcalloc() in otx2_config_npa()
  net: qrtr: fix usage of idr in port assignment to socket
  selftests: disable rp_filter for icmp_redirect.sh
  Revert "net: xdp: pull ethernet header off packet after computing skb->protocol"
  phylink: <linux/phylink.h>: fix function prototype kernel-doc warning
  mptcp: sendmsg: reset iter on error redux
  net: devlink: Remove overzealous WARN_ON with snapshots
  tipc: not enable tipc when ipv6 works as a module
  tipc: fix uninit skb->data in tipc_nl_compat_dumpit()
  net: Fix potential wrong skb->protocol in skb_vlan_untag()
  net: xdp: pull ethernet header off packet after computing skb->protocol
  ipvlan: fix device features
  bonding: fix a potential double-unregister
  can: j1939: add rxtimer for multipacket broadcast session
  can: j1939: abort multipacket broadcast session when timeout occurs
  can: j1939: cancel rxtimer on multipacket broadcast session complete
  can: j1939: fix support for multipacket broadcast message
  net: fddi: skfp: cfm: Remove seemingly unused variable 'ID_sccs'
  net: fddi: skfp: cfm: Remove set but unused variable 'oldstate'
  net: fddi: skfp: smt: Remove seemingly unused variable 'ID_sccs'
  ...
2020-08-17 17:09:50 -07:00
Xu Wang bf2bcd6f1a otx2_common: Use devm_kcalloc() in otx2_config_npa()
A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "devm_kcalloc".

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 15:08:39 -07:00
Necip Fazil Yildiran 8dfddfb796 net: qrtr: fix usage of idr in port assignment to socket
Passing large uint32 sockaddr_qrtr.port numbers for port allocation
triggers a warning within idr_alloc() since the port number is cast
to int, and thus interpreted as a negative number. This leads to
the rejection of such valid port numbers in qrtr_port_assign() as
idr_alloc() fails.

To avoid the problem, switch to idr_alloc_u32() instead.

Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Reported-by: syzbot+f31428628ef672716ea8@syzkaller.appspotmail.com
Signed-off-by: Necip Fazil Yildiran <necip@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 15:00:41 -07:00
David Ahern bcf7ddb018 selftests: disable rp_filter for icmp_redirect.sh
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 14:59:24 -07:00
Kees Cook 5a4fe06246 mailmap: Add WeiXiong Liao
WeiXiong Liao noted to me offlist that his preference for email address
had changed and that he'd like it updated in the mailmap so people
discussing pstore/blk would be able to reach him.

Cc: WeiXiong Liao <gmpy.liaowx@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-08-17 14:32:44 -07:00
Kees Cook d6bd5201f7 mailmap: Restore dictionary sorting
Several names had been recently appended (instead of inserted). While
git-shortlog doesn't need this file to be sorted, it helps humans to
keep it organized this way. Sort the entire file (which includes some
minor shuffling for dictionary order).

Done with the following commands:

	grep -E '^(#|$)' .mailmap > .mailmap.head
	grep -Ev '^(#|$)' .mailmap > .mailmap.body
 	sort -f .mailmap.body > .mailmap.body.sort
	cat .mailmap.head .mailmap.body.sort > .mailmap
	rm .mailmap.head .mailmap.body.sort

Signed-off-by: Kees Cook <keescook@chromium.org>
2020-08-17 14:32:44 -07:00
Jessica Clarke bd05220c7b arch/ia64: Restore arch-specific pgd_offset_k implementation
IA-64 is special and treats pgd_offset_k() differently to pgd_offset(),
using different formulae to calculate the indices into the kernel and user
PGDs.  The index into the user PGDs takes into account the region number,
but the index into the kernel (init_mm) PGD always assumes a predefined
kernel region number. Commit 974b9b2c68 ("mm: consolidate pte_index() and
pte_offset_*() definitions") made IA-64 use a generic pgd_offset_k() which
incorrectly used pgd_index() for kernel page tables.  As a result, the
index into the kernel PGD was going out of bounds and the kernel hung
during early boot.

Allow overrides of pgd_offset_k() and override it on IA-64 with the old
implementation that will correctly index the kernel PGD.

Fixes: 974b9b2c68 ("mm: consolidate pte_index() and pte_offset_*() definitions")
Reported-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Jessica Clarke <jrtc27@jrtc27.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
2020-08-17 21:50:54 +03:00
David S. Miller 7f9bf6e824 Revert "net: xdp: pull ethernet header off packet after computing skb->protocol"
This reverts commit f8414a8d88.

eth_type_trans() does the necessary pull on the skb.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 11:48:05 -07:00
Randy Dunlap 0b76e642f9 phylink: <linux/phylink.h>: fix function prototype kernel-doc warning
Fix a kernel-doc warning for the pcs_config() function prototype:

../include/linux/phylink.h:406: warning: Excess function parameter 'permit_pause_to_mac' description in 'pcs_config'

Fixes: 7137e18f6f ("net: phylink: add struct phylink_pcs")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 11:46:22 -07:00
David Howells 29e44f4535 watch_queue: Limit the number of watches a user can hold
Impose a limit on the number of watches that a user can hold so that
they can't use this mechanism to fill up all the available memory.

This is done by putting a counter in user_struct that's incremented when
a watch is allocated and decreased when it is released.  If the number
exceeds the RLIMIT_NOFILE limit, the watch is rejected with EAGAIN.

This can be tested by the following means:

 (1) Create a watch queue and attach it to fd 5 in the program given - in
     this case, bash:

	keyctl watch_session /tmp/nlog /tmp/gclog 5 bash

 (2) In the shell, set the maximum number of files to, say, 99:

	ulimit -n 99

 (3) Add 200 keyrings:

	for ((i=0; i<200; i++)); do keyctl newring a$i @s || break; done

 (4) Try to watch all of the keyrings:

	for ((i=0; i<200; i++)); do echo $i; keyctl watch_add 5 %:a$i || break; done

     This should fail when the number of watches belonging to the user hits
     99.

 (5) Remove all the keyrings and all of those watches should go away:

	for ((i=0; i<200; i++)); do keyctl unlink %:a$i; done

 (6) Kill off the watch queue by exiting the shell spawned by
     watch_session.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-17 09:39:18 -07:00
Florian Westphal b3b2854dcf mptcp: sendmsg: reset iter on error redux
This fix wasn't correct: When this function is invoked from the
retransmission worker, the iterator contains garbage and resetting
it causes a crash.

As the work queue should not be performance critical also zero the
msghdr struct.

Fixes: 3575938313 "(mptcp: sendmsg: reset iter on error)"
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:11:37 -07:00
Andrew Lunn bd71ea6067 net: devlink: Remove overzealous WARN_ON with snapshots
It is possible to trigger this WARN_ON from user space by triggering a
devlink snapshot with an ID which already exists. We don't need both
-EEXISTS being reported and spamming the kernel log.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:07:06 -07:00
Xin Long c530189905 tipc: not enable tipc when ipv6 works as a module
When using ipv6_dev_find() in one module, it requires ipv6 not to
work as a module. Otherwise, this error occurs in build:

  undefined reference to `ipv6_dev_find'.

So fix it by adding "depends on IPV6 || IPV6=n" to tipc/Kconfig,
as it does in sctp/Kconfig.

Fixes: 5a6f6f5791 ("tipc: set ub->ifindex for local ipv6 address")
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:04:55 -07:00
Cong Wang 47733f9daf tipc: fix uninit skb->data in tipc_nl_compat_dumpit()
__tipc_nl_compat_dumpit() has two callers, and it expects them to
pass a valid nlmsghdr via arg->data. This header is artificial and
crafted just for __tipc_nl_compat_dumpit().

tipc_nl_compat_publ_dump() does so by putting a genlmsghdr as well
as some nested attribute, TIPC_NLA_SOCK. But the other caller
tipc_nl_compat_dumpit() does not, this leaves arg->data uninitialized
on this call path.

Fix this by just adding a similar nlmsghdr without any payload in
tipc_nl_compat_dumpit().

This bug exists since day 1, but the recent commit 6ea67769ff
("net: tipc: prepare attrs in __tipc_nl_compat_dumpit()") makes it
easier to appear.

Reported-and-tested-by: syzbot+0e7181deafa7e0b79923@syzkaller.appspotmail.com
Fixes: d0796d1ef6 ("tipc: convert legacy nl bearer dump to nl compat")
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:03:19 -07:00