License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2017-06-20 07:37:55 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2016 Thomas Gleixner.
|
|
|
|
* Copyright (C) 2016-2017 Christoph Hellwig.
|
|
|
|
*/
|
2016-07-04 16:39:27 +08:00
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
|
2016-09-14 22:18:48 +08:00
|
|
|
static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int cpus_per_vec)
|
2016-09-14 22:18:48 +08:00
|
|
|
{
|
|
|
|
const struct cpumask *siblmsk;
|
|
|
|
int cpu, sibl;
|
|
|
|
|
|
|
|
for ( ; cpus_per_vec > 0; ) {
|
|
|
|
cpu = cpumask_first(nmsk);
|
|
|
|
|
|
|
|
/* Should not happen, but I'm too lazy to think about it */
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
return;
|
|
|
|
|
|
|
|
cpumask_clear_cpu(cpu, nmsk);
|
|
|
|
cpumask_set_cpu(cpu, irqmsk);
|
|
|
|
cpus_per_vec--;
|
|
|
|
|
|
|
|
/* If the cpu has siblings, use them first */
|
|
|
|
siblmsk = topology_sibling_cpumask(cpu);
|
|
|
|
for (sibl = -1; cpus_per_vec > 0; ) {
|
|
|
|
sibl = cpumask_next(sibl, siblmsk);
|
|
|
|
if (sibl >= nr_cpu_ids)
|
|
|
|
break;
|
|
|
|
if (!cpumask_test_and_clear_cpu(sibl, nmsk))
|
|
|
|
continue;
|
|
|
|
cpumask_set_cpu(sibl, irqmsk);
|
|
|
|
cpus_per_vec--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-08 18:53:55 +08:00
|
|
|
static cpumask_var_t *alloc_node_to_cpumask(void)
|
2017-06-20 07:37:55 +08:00
|
|
|
{
|
|
|
|
cpumask_var_t *masks;
|
|
|
|
int node;
|
|
|
|
|
|
|
|
masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
|
|
|
|
if (!masks)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
for (node = 0; node < nr_node_ids; node++) {
|
|
|
|
if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
|
|
|
|
goto out_unwind;
|
|
|
|
}
|
|
|
|
|
|
|
|
return masks;
|
|
|
|
|
|
|
|
out_unwind:
|
|
|
|
while (--node >= 0)
|
|
|
|
free_cpumask_var(masks[node]);
|
|
|
|
kfree(masks);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2018-03-08 18:53:55 +08:00
|
|
|
static void free_node_to_cpumask(cpumask_var_t *masks)
|
2017-06-20 07:37:55 +08:00
|
|
|
{
|
|
|
|
int node;
|
|
|
|
|
|
|
|
for (node = 0; node < nr_node_ids; node++)
|
|
|
|
free_cpumask_var(masks[node]);
|
|
|
|
kfree(masks);
|
|
|
|
}
|
|
|
|
|
2018-03-08 18:53:55 +08:00
|
|
|
static void build_node_to_cpumask(cpumask_var_t *masks)
|
2017-06-20 07:37:55 +08:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
2018-01-12 10:53:05 +08:00
|
|
|
for_each_possible_cpu(cpu)
|
2017-06-20 07:37:55 +08:00
|
|
|
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
|
|
|
|
}
|
|
|
|
|
2018-03-08 18:53:55 +08:00
|
|
|
static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
|
2017-06-20 07:37:55 +08:00
|
|
|
const struct cpumask *mask, nodemask_t *nodemsk)
|
2016-09-14 22:18:48 +08:00
|
|
|
{
|
2016-12-15 02:01:12 +08:00
|
|
|
int n, nodes = 0;
|
2016-09-14 22:18:48 +08:00
|
|
|
|
|
|
|
/* Calculate the number of nodes in the supplied affinity mask */
|
2017-06-20 07:37:55 +08:00
|
|
|
for_each_node(n) {
|
2018-03-08 18:53:55 +08:00
|
|
|
if (cpumask_intersects(mask, node_to_cpumask[n])) {
|
2016-09-14 22:18:48 +08:00
|
|
|
node_set(n, *nodemsk);
|
|
|
|
nodes++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nodes;
|
|
|
|
}
|
|
|
|
|
2018-11-02 22:59:49 +08:00
|
|
|
static int __irq_build_affinity_masks(const struct irq_affinity *affd,
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int startvec,
|
|
|
|
unsigned int numvecs,
|
|
|
|
unsigned int firstvec,
|
2018-12-18 23:06:53 +08:00
|
|
|
cpumask_var_t *node_to_cpumask,
|
|
|
|
const struct cpumask *cpu_mask,
|
|
|
|
struct cpumask *nmsk,
|
2018-12-04 23:51:20 +08:00
|
|
|
struct irq_affinity_desc *masks)
|
2016-09-14 22:18:48 +08:00
|
|
|
{
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
|
|
|
|
unsigned int last_affv = firstvec + numvecs;
|
|
|
|
unsigned int curvec = startvec;
|
2016-09-14 22:18:48 +08:00
|
|
|
nodemask_t nodemsk = NODE_MASK_NONE;
|
|
|
|
|
genirq/affinity: Spread irq vectors among present CPUs as far as possible
Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
tried to spread the interrupts accross all possible CPUs to make sure that
in case of phsyical hotplug (e.g. virtualization) the CPUs which get
plugged in after the device was initialized are targeted by a hardware
queue and the corresponding interrupt.
This has a downside in cases where the ACPI tables claim that there are
more possible CPUs than present CPUs and the number of interrupts to spread
out is smaller than the number of possible CPUs. These bogus ACPI tables
are unfortunately not uncommon.
In such a case the vector spreading algorithm assigns interrupts to CPUs
which can never be utilized and as a consequence these interrupts are
unused instead of being mapped to present CPUs. As a result the performance
of the device is suboptimal.
To fix this spread the interrupt vectors in two stages:
1) Spread as many interrupts as possible among the present CPUs
2) Spread the remaining vectors among non present CPUs
On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present,
for a device with 4 queues the resulting interrupt affinity is:
1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3
2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7
3) With the refined vector spread applied:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7
On a 8 core system, where all CPUs are present the resulting interrupt
affinity for the 4 queues is:
irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7
This is independent of the number of CPUs which are online at the point of
initialization because in such a system the offline CPUs can be easily
onlined afterwards, while in non-present CPUs need to be plugged physically
or virtually which requires external interaction.
The downside of this approach is that in case of physical hotplug the
interrupt vector spreading might be suboptimal when CPUs 4-7 are physically
plugged. Suboptimal from a NUMA point of view and due to the single target
nature of interrupt affinities the later plugged CPUs might not be targeted
by interrupts at all.
Though, physical hotplug systems are not the common case while the broken
ACPI table disease is wide spread. So it's preferred to have as many
interrupts as possible utilized at the point where the device is
initialized.
Block multi-queue devices like NVME create a hardware queue per possible
CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per
possible CPU is still achieved even with physical/virtual hotplug.
[ tglx: Changed from online to present CPUs for the first spreading stage,
renamed variables for readability sake, added comments and massaged
changelog ]
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
2018-03-08 18:53:58 +08:00
|
|
|
if (!cpumask_weight(cpu_mask))
|
|
|
|
return 0;
|
|
|
|
|
2018-03-08 18:53:56 +08:00
|
|
|
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
|
2016-09-14 22:18:48 +08:00
|
|
|
|
|
|
|
/*
|
2016-12-15 02:01:12 +08:00
|
|
|
* If the number of nodes in the mask is greater than or equal the
|
2016-09-14 22:18:48 +08:00
|
|
|
* number of vectors we just spread the vectors across the nodes.
|
|
|
|
*/
|
2018-03-08 18:53:57 +08:00
|
|
|
if (numvecs <= nodes) {
|
2016-09-14 22:18:48 +08:00
|
|
|
for_each_node_mask(n, nodemsk) {
|
2019-02-17 01:13:07 +08:00
|
|
|
cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
|
|
|
|
node_to_cpumask[n]);
|
2018-03-08 18:53:57 +08:00
|
|
|
if (++curvec == last_affv)
|
2018-11-02 22:59:50 +08:00
|
|
|
curvec = firstvec;
|
2016-09-14 22:18:48 +08:00
|
|
|
}
|
2019-02-17 01:13:07 +08:00
|
|
|
return numvecs;
|
2016-09-14 22:18:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
for_each_node_mask(n, nodemsk) {
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
|
2017-04-04 03:25:53 +08:00
|
|
|
|
|
|
|
/* Spread the vectors per node */
|
2018-11-02 22:59:50 +08:00
|
|
|
vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
|
2016-09-14 22:18:48 +08:00
|
|
|
|
|
|
|
/* Get the cpus on this node which are in the mask */
|
2018-03-08 18:53:56 +08:00
|
|
|
cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
|
2016-09-14 22:18:48 +08:00
|
|
|
|
|
|
|
/* Calculate the number of cpus per vector */
|
|
|
|
ncpus = cpumask_weight(nmsk);
|
2017-04-04 03:25:53 +08:00
|
|
|
vecs_to_assign = min(vecs_per_node, ncpus);
|
|
|
|
|
|
|
|
/* Account for rounding errors */
|
2017-04-14 01:28:12 +08:00
|
|
|
extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
|
2016-09-14 22:18:48 +08:00
|
|
|
|
2016-11-15 17:12:58 +08:00
|
|
|
for (v = 0; curvec < last_affv && v < vecs_to_assign;
|
|
|
|
curvec++, v++) {
|
2016-09-14 22:18:48 +08:00
|
|
|
cpus_per_vec = ncpus / vecs_to_assign;
|
|
|
|
|
|
|
|
/* Account for extra vectors to compensate rounding errors */
|
|
|
|
if (extra_vecs) {
|
|
|
|
cpus_per_vec++;
|
2017-04-04 03:25:53 +08:00
|
|
|
--extra_vecs;
|
2016-09-14 22:18:48 +08:00
|
|
|
}
|
2018-12-04 23:51:20 +08:00
|
|
|
irq_spread_init_one(&masks[curvec].mask, nmsk,
|
|
|
|
cpus_per_vec);
|
2016-09-14 22:18:48 +08:00
|
|
|
}
|
|
|
|
|
2018-03-08 18:53:57 +08:00
|
|
|
done += v;
|
|
|
|
if (done >= numvecs)
|
2016-09-14 22:18:48 +08:00
|
|
|
break;
|
2018-03-08 18:53:57 +08:00
|
|
|
if (curvec >= last_affv)
|
2018-11-02 22:59:50 +08:00
|
|
|
curvec = firstvec;
|
2017-04-04 03:25:53 +08:00
|
|
|
--nodes;
|
2016-09-14 22:18:48 +08:00
|
|
|
}
|
2018-03-08 18:53:57 +08:00
|
|
|
return done;
|
2018-03-08 18:53:56 +08:00
|
|
|
}
|
|
|
|
|
2018-11-02 22:59:49 +08:00
|
|
|
/*
|
|
|
|
* build affinity in two stages:
|
|
|
|
* 1) spread present CPU on these vectors
|
|
|
|
* 2) spread other possible CPUs on these vectors
|
|
|
|
*/
|
|
|
|
static int irq_build_affinity_masks(const struct irq_affinity *affd,
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int startvec, unsigned int numvecs,
|
|
|
|
unsigned int firstvec,
|
2018-12-04 23:51:20 +08:00
|
|
|
struct irq_affinity_desc *masks)
|
2018-11-02 22:59:49 +08:00
|
|
|
{
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int curvec = startvec, nr_present, nr_others;
|
2019-01-25 17:53:43 +08:00
|
|
|
cpumask_var_t *node_to_cpumask;
|
2019-02-17 01:13:07 +08:00
|
|
|
cpumask_var_t nmsk, npresmsk;
|
|
|
|
int ret = -ENOMEM;
|
2018-11-02 22:59:49 +08:00
|
|
|
|
|
|
|
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
|
2018-12-18 23:06:53 +08:00
|
|
|
return ret;
|
2018-11-02 22:59:49 +08:00
|
|
|
|
|
|
|
if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
|
2019-01-25 17:53:43 +08:00
|
|
|
goto fail_nmsk;
|
|
|
|
|
|
|
|
node_to_cpumask = alloc_node_to_cpumask();
|
|
|
|
if (!node_to_cpumask)
|
|
|
|
goto fail_npresmsk;
|
2018-11-02 22:59:49 +08:00
|
|
|
|
2018-11-02 22:59:51 +08:00
|
|
|
ret = 0;
|
2018-11-02 22:59:49 +08:00
|
|
|
/* Stabilize the cpumasks */
|
|
|
|
get_online_cpus();
|
|
|
|
build_node_to_cpumask(node_to_cpumask);
|
|
|
|
|
|
|
|
/* Spread on present CPUs starting from affd->pre_vectors */
|
2018-11-02 22:59:51 +08:00
|
|
|
nr_present = __irq_build_affinity_masks(affd, curvec, numvecs,
|
|
|
|
firstvec, node_to_cpumask,
|
|
|
|
cpu_present_mask, nmsk, masks);
|
2018-11-02 22:59:49 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Spread on non present CPUs starting from the next vector to be
|
|
|
|
* handled. If the spreading of present CPUs already exhausted the
|
|
|
|
* vector space, assign the non present CPUs to the already spread
|
|
|
|
* out vectors.
|
|
|
|
*/
|
2018-11-02 22:59:51 +08:00
|
|
|
if (nr_present >= numvecs)
|
|
|
|
curvec = firstvec;
|
2018-11-02 22:59:49 +08:00
|
|
|
else
|
2018-11-02 22:59:51 +08:00
|
|
|
curvec = firstvec + nr_present;
|
2018-11-02 22:59:49 +08:00
|
|
|
cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
|
2018-11-02 22:59:51 +08:00
|
|
|
nr_others = __irq_build_affinity_masks(affd, curvec, numvecs,
|
|
|
|
firstvec, node_to_cpumask,
|
|
|
|
npresmsk, nmsk, masks);
|
2018-11-02 22:59:49 +08:00
|
|
|
put_online_cpus();
|
|
|
|
|
2018-11-02 22:59:51 +08:00
|
|
|
if (nr_present < numvecs)
|
2018-12-18 23:06:53 +08:00
|
|
|
WARN_ON(nr_present + nr_others < numvecs);
|
2018-11-02 22:59:51 +08:00
|
|
|
|
2019-01-25 17:53:43 +08:00
|
|
|
free_node_to_cpumask(node_to_cpumask);
|
|
|
|
|
|
|
|
fail_npresmsk:
|
2018-11-02 22:59:49 +08:00
|
|
|
free_cpumask_var(npresmsk);
|
|
|
|
|
2019-01-25 17:53:43 +08:00
|
|
|
fail_nmsk:
|
2018-11-02 22:59:49 +08:00
|
|
|
free_cpumask_var(nmsk);
|
2018-11-02 22:59:51 +08:00
|
|
|
return ret;
|
2018-11-02 22:59:49 +08:00
|
|
|
}
|
|
|
|
|
2018-03-08 18:53:56 +08:00
|
|
|
/**
|
|
|
|
* irq_create_affinity_masks - Create affinity masks for multiqueue spreading
|
|
|
|
* @nvecs: The total number of vectors
|
|
|
|
* @affd: Description of the affinity requirements
|
|
|
|
*
|
2018-12-04 23:51:20 +08:00
|
|
|
* Returns the irq_affinity_desc pointer or NULL if allocation failed.
|
2018-03-08 18:53:56 +08:00
|
|
|
*/
|
2018-12-04 23:51:20 +08:00
|
|
|
struct irq_affinity_desc *
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
|
2018-03-08 18:53:56 +08:00
|
|
|
{
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int affvecs, curvec, usedvecs, nr_sets, i;
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
unsigned int set_size[IRQ_AFFINITY_MAX_SETS];
|
2018-12-04 23:51:20 +08:00
|
|
|
struct irq_affinity_desc *masks = NULL;
|
2018-03-08 18:53:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there aren't any vectors left after applying the pre/post
|
|
|
|
* vectors don't bother with assigning affinity.
|
|
|
|
*/
|
|
|
|
if (nvecs == affd->pre_vectors + affd->post_vectors)
|
|
|
|
return NULL;
|
|
|
|
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS))
|
|
|
|
return NULL;
|
|
|
|
|
2018-03-08 18:53:56 +08:00
|
|
|
masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
|
|
|
|
if (!masks)
|
2019-01-25 17:53:43 +08:00
|
|
|
return NULL;
|
2018-03-08 18:53:56 +08:00
|
|
|
|
|
|
|
/* Fill out vectors at the beginning that don't need affinity */
|
|
|
|
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
|
2018-12-04 23:51:20 +08:00
|
|
|
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
|
2018-11-02 22:59:51 +08:00
|
|
|
/*
|
|
|
|
* Spread on present CPUs starting from affd->pre_vectors. If we
|
|
|
|
* have multiple sets, build each sets affinity mask separately.
|
|
|
|
*/
|
2019-02-17 01:13:07 +08:00
|
|
|
affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
|
2018-11-02 22:59:51 +08:00
|
|
|
nr_sets = affd->nr_sets;
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
if (!nr_sets) {
|
2018-11-02 22:59:51 +08:00
|
|
|
nr_sets = 1;
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
set_size[0] = affvecs;
|
|
|
|
} else {
|
|
|
|
memcpy(set_size, affd->set_size, nr_sets * sizeof(unsigned int));
|
|
|
|
}
|
2018-11-02 22:59:51 +08:00
|
|
|
|
|
|
|
for (i = 0, usedvecs = 0; i < nr_sets; i++) {
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
unsigned int this_vecs = set_size[i];
|
2018-11-02 22:59:51 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = irq_build_affinity_masks(affd, curvec, this_vecs,
|
2019-02-17 01:13:07 +08:00
|
|
|
curvec, masks);
|
2018-11-02 22:59:51 +08:00
|
|
|
if (ret) {
|
2018-12-18 23:06:53 +08:00
|
|
|
kfree(masks);
|
2019-01-25 17:53:43 +08:00
|
|
|
return NULL;
|
2018-11-02 22:59:51 +08:00
|
|
|
}
|
|
|
|
curvec += this_vecs;
|
|
|
|
usedvecs += this_vecs;
|
|
|
|
}
|
2016-11-09 09:15:03 +08:00
|
|
|
|
|
|
|
/* Fill out vectors at the end that don't need affinity */
|
genirq/affinity: Spread irq vectors among present CPUs as far as possible
Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
tried to spread the interrupts accross all possible CPUs to make sure that
in case of phsyical hotplug (e.g. virtualization) the CPUs which get
plugged in after the device was initialized are targeted by a hardware
queue and the corresponding interrupt.
This has a downside in cases where the ACPI tables claim that there are
more possible CPUs than present CPUs and the number of interrupts to spread
out is smaller than the number of possible CPUs. These bogus ACPI tables
are unfortunately not uncommon.
In such a case the vector spreading algorithm assigns interrupts to CPUs
which can never be utilized and as a consequence these interrupts are
unused instead of being mapped to present CPUs. As a result the performance
of the device is suboptimal.
To fix this spread the interrupt vectors in two stages:
1) Spread as many interrupts as possible among the present CPUs
2) Spread the remaining vectors among non present CPUs
On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present,
for a device with 4 queues the resulting interrupt affinity is:
1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3
2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7
3) With the refined vector spread applied:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7
On a 8 core system, where all CPUs are present the resulting interrupt
affinity for the 4 queues is:
irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7
This is independent of the number of CPUs which are online at the point of
initialization because in such a system the offline CPUs can be easily
onlined afterwards, while in non-present CPUs need to be plugged physically
or virtually which requires external interaction.
The downside of this approach is that in case of physical hotplug the
interrupt vector spreading might be suboptimal when CPUs 4-7 are physically
plugged. Suboptimal from a NUMA point of view and due to the single target
nature of interrupt affinities the later plugged CPUs might not be targeted
by interrupts at all.
Though, physical hotplug systems are not the common case while the broken
ACPI table disease is wide spread. So it's preferred to have as many
interrupts as possible utilized at the point where the device is
initialized.
Block multi-queue devices like NVME create a hardware queue per possible
CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per
possible CPU is still achieved even with physical/virtual hotplug.
[ tglx: Changed from online to present CPUs for the first spreading stage,
renamed variables for readability sake, added comments and massaged
changelog ]
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
2018-03-08 18:53:58 +08:00
|
|
|
if (usedvecs >= affvecs)
|
|
|
|
curvec = affd->pre_vectors + affvecs;
|
|
|
|
else
|
|
|
|
curvec = affd->pre_vectors + usedvecs;
|
2016-11-09 09:15:03 +08:00
|
|
|
for (; curvec < nvecs; curvec++)
|
2018-12-04 23:51:20 +08:00
|
|
|
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
|
genirq/affinity: Spread irq vectors among present CPUs as far as possible
Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
tried to spread the interrupts accross all possible CPUs to make sure that
in case of phsyical hotplug (e.g. virtualization) the CPUs which get
plugged in after the device was initialized are targeted by a hardware
queue and the corresponding interrupt.
This has a downside in cases where the ACPI tables claim that there are
more possible CPUs than present CPUs and the number of interrupts to spread
out is smaller than the number of possible CPUs. These bogus ACPI tables
are unfortunately not uncommon.
In such a case the vector spreading algorithm assigns interrupts to CPUs
which can never be utilized and as a consequence these interrupts are
unused instead of being mapped to present CPUs. As a result the performance
of the device is suboptimal.
To fix this spread the interrupt vectors in two stages:
1) Spread as many interrupts as possible among the present CPUs
2) Spread the remaining vectors among non present CPUs
On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present,
for a device with 4 queues the resulting interrupt affinity is:
1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3
2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7
3) With the refined vector spread applied:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7
On a 8 core system, where all CPUs are present the resulting interrupt
affinity for the 4 queues is:
irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7
This is independent of the number of CPUs which are online at the point of
initialization because in such a system the offline CPUs can be easily
onlined afterwards, while in non-present CPUs need to be plugged physically
or virtually which requires external interaction.
The downside of this approach is that in case of physical hotplug the
interrupt vector spreading might be suboptimal when CPUs 4-7 are physically
plugged. Suboptimal from a NUMA point of view and due to the single target
nature of interrupt affinities the later plugged CPUs might not be targeted
by interrupts at all.
Though, physical hotplug systems are not the common case while the broken
ACPI table disease is wide spread. So it's preferred to have as many
interrupts as possible utilized at the point where the device is
initialized.
Block multi-queue devices like NVME create a hardware queue per possible
CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per
possible CPU is still achieved even with physical/virtual hotplug.
[ tglx: Changed from online to present CPUs for the first spreading stage,
renamed variables for readability sake, added comments and massaged
changelog ]
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20180308105358.1506-5-ming.lei@redhat.com
2018-03-08 18:53:58 +08:00
|
|
|
|
2018-12-04 23:51:21 +08:00
|
|
|
/* Mark the managed interrupts */
|
|
|
|
for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
|
|
|
|
masks[i].is_managed = 1;
|
|
|
|
|
2016-09-14 22:18:48 +08:00
|
|
|
return masks;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2016-11-09 09:15:02 +08:00
|
|
|
* irq_calc_affinity_vectors - Calculate the optimal number of vectors
|
2017-05-19 01:47:47 +08:00
|
|
|
* @minvec: The minimum number of vectors available
|
2016-11-09 09:15:02 +08:00
|
|
|
* @maxvec: The maximum number of vectors available
|
|
|
|
* @affd: Description of the affinity requirements
|
2016-09-14 22:18:48 +08:00
|
|
|
*/
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
|
|
|
|
const struct irq_affinity *affd)
|
2016-09-14 22:18:48 +08:00
|
|
|
{
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int resv = affd->pre_vectors + affd->post_vectors;
|
|
|
|
unsigned int set_vecs;
|
2016-09-14 22:18:48 +08:00
|
|
|
|
2017-05-19 01:47:47 +08:00
|
|
|
if (resv > minvec)
|
|
|
|
return 0;
|
|
|
|
|
2018-11-02 22:59:51 +08:00
|
|
|
if (affd->nr_sets) {
|
2019-02-17 01:13:07 +08:00
|
|
|
unsigned int i;
|
2018-11-02 22:59:51 +08:00
|
|
|
|
|
|
|
for (i = 0, set_vecs = 0; i < affd->nr_sets; i++)
|
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-17 01:13:08 +08:00
|
|
|
set_vecs += affd->set_size[i];
|
2018-11-02 22:59:51 +08:00
|
|
|
} else {
|
|
|
|
get_online_cpus();
|
|
|
|
set_vecs = cpumask_weight(cpu_possible_mask);
|
|
|
|
put_online_cpus();
|
|
|
|
}
|
|
|
|
|
2019-02-17 01:13:07 +08:00
|
|
|
return resv + min(set_vecs, maxvec - resv);
|
2016-09-14 22:18:48 +08:00
|
|
|
}
|