commit aff649361671b432570e94c9056932f50dd6f101 openeuler.
----------------------------------------------------------------------
In the past configuration, CONFIG_SCHED_CLUSTER was not set. Now, we need
to open the configuration.
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit 6afb257d6dd71085344e1472ea6e820b5dc0a8e3 openeuler.
----------------------------------------------------------------------
Disable cluster scheduling by default since it's not a universal win.
User can choose to enable it through sysctl or at boot time according to
their scenario.
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit 8ce3e706b31409147f035c037055caa68e450ce5 openeuler.
Reference: https://lore.kernel.org/lkml/cover.1638563225.git.tim.c.chen@linux.intel.com/
----------------------------------------------------------------------
Allow run time configuration of the scheduler to use cluster
scheduling. Configuration can be changed via the sysctl variable
/proc/sys/kernel/sched_cluster. Setting it to 1 enable cluster
scheduling and setting it to 0 turns it off.
Cluster scheduling should benefit independent tasks by load balancing
them between clusters. It reaps the most benefit when the system's CPUs
are not fully busy, so we can spread the tasks out between the clusters to
reduce contention on cluster resource (e.g. L2 cache).
However, if the system is expected to operate close to full utilization,
the system admin could turn this feature off so as not to incur
extra load balancing overhead between the cluster domains.
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit 211b6fb7d5a8558a453475a08a697e651ca2d0cb openeuler.
Reference: https://lore.kernel.org/lkml/cover.1638563225.git.tim.c.chen@linux.intel.com/
----------------------------------------------------------------------
A system admin may not want to use cluster scheduling. Make changes to
allow cluster topology level to be skipped when building sched domains.
Create SDTL_SKIP bit on the sched_domain_topology_level flag so we can
check if the cluster topology level should be skipped when building
sched domains.
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
Reference: https://lore.kernel.org/lkml/20220915073423.25535-1-yangyicong@huawei.com/
commit 0c3a4f986962ed94da6e26ba3ec0bdf700945894 openeuler.
----------------------------------------------------------------------
For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency.
Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
With this patch, We noticed enhancement on tbench within one numa or cross
two numa.
On numa 0:
6.0-rc1 patched
Hmean 1 351.20 ( 0.00%) 396.45 * 12.88%*
Hmean 2 700.43 ( 0.00%) 793.76 * 13.32%*
Hmean 4 1404.42 ( 0.00%) 1583.62 * 12.76%*
Hmean 8 2833.31 ( 0.00%) 3147.85 * 11.10%*
Hmean 16 5501.90 ( 0.00%) 6089.89 * 10.69%*
Hmean 32 10428.59 ( 0.00%) 10619.63 * 1.83%*
Hmean 64 8223.39 ( 0.00%) 8306.93 * 1.02%*
Hmean 128 7042.88 ( 0.00%) 7068.03 * 0.36%*
On numa 0-1:
6.0-rc1 patched
Hmean 1 363.06 ( 0.00%) 397.13 * 9.38%*
Hmean 2 721.68 ( 0.00%) 789.84 * 9.44%*
Hmean 4 1435.15 ( 0.00%) 1566.01 * 9.12%*
Hmean 8 2776.17 ( 0.00%) 3007.05 * 8.32%*
Hmean 16 5471.71 ( 0.00%) 6103.91 * 11.55%*
Hmean 32 10164.98 ( 0.00%) 11531.81 * 13.45%*
Hmean 64 17143.28 ( 0.00%) 20078.68 * 17.12%*
Hmean 128 14552.70 ( 0.00%) 15156.41 * 4.15%*
Hmean 256 12827.37 ( 0.00%) 13326.86 * 3.89%*
Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
Reference: https://lore.kernel.org/lkml/20220915073423.25535-1-yangyicong@huawei.com/
commit 53ad6bf76d9c646e3c8494ed82d90f304c50de1f openeuler.
----------------------------------------------------------------------
Add per-cpu cluster domain info and cpus_share_lowest_cache() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
mainline inclusion
from mainline-v6.0-rc5
commit 5ac251c8a0 upstream.
----------------------------------------------------------------------
Currently cpu_clustergroup_mask() will return CPU mask if cluster span more
or the same CPUs as cpu_coregroup_mask(). This will result topology borken
on non-Cluster SMT machines when building with CONFIG_SCHED_CLUSTER=y.
Test with:
qemu-system-aarch64 -enable-kvm -machine virt \
-net none \
-cpu host \
-bios ./QEMU_EFI.fd \
-m 2G \
-smp 48,sockets=2,cores=12,threads=2 \
-kernel $Image \
-initrd $Rootfs \
-nographic
-append "rdinit=init console=ttyAMA0 sched_verbose loglevel=8"
We'll get below error:
[ 3.084568] BUG: arch topology borken
[ 3.084570] the SMT domain not a subset of the CLS domain
Since cluster is a level higher than SMT, fix this by making cluster
spans at least SMT CPUs.
Fixes: bfcc439743 ("arch_topology: Limit span of cpu_clustergroup_mask()")
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20220905122615.12946-1-yangyicong@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
mainline inclusion
from mainline-v6.0-rc1
commit bfcc439743 upstream.
----------------------------------------------------------------------
Currently the cluster identifier is not set on DT based platforms.
The reset or default value is -1 for all the CPUs. Once we assign the
cluster identifier values correctly, the cluster_sibling mask will be
populated and returned by cpu_clustergroup_mask() to contribute in the
creation of the CLS scheduling domain level, if SCHED_CLUSTER is
enabled.
To avoid topologies that will result in questionable or incorrect
scheduling domains, impose restrictions regarding the span of clusters,
as presented to scheduling domains building code: cluster_sibling should
not span more or the same CPUs as cpu_coregroup_mask().
This is needed in order to obtain a strict separation between the MC and
CLS levels, and maintain the same domains for existing platforms in
the presence of CONFIG_SCHED_CLUSTER, where the new cluster information
is redundant and irrelevant for the scheduler.
While previously the scheduling domain builder code would have removed MC
as redundant and kept CLS if SCHED_CLUSTER was enabled and the
cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs,
now CLS will be removed and MC kept.
Link: https://lore.kernel.org/r/20220704101605.1318280-18-sudeep.holla@arm.com
Cc: Darren Hart <darren@os.amperecomputing.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
mainline inclusion
from mainline-v5.19-rc1
commit 15f214f9bd upstream.
------------------------------------------------------------------------
default_topology[] uses cpu_clustergroup_mask() for the CLS level
(guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86
(arch/x86/kernel/smpboot.c) and arm64 (drivers/base/arch_topology.c).
Fixes: 778c558f49 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
Acked-by: Barry Song <baohua@kernel.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20220513093433.425163-1-dietmar.eggemann@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
mainline inclusion
from mainline-v5.17-rc1
commit e795707703 upstream.
----------------------------------------------------------------------
The cluster_id and cluster_cpus topology sysfs attributes have been
added with commit c5e22feffd ("topology: Represent clusters of CPUs
within a die").
They are currently only used for x86, arm64, and riscv (via generic
arch topology), however they are still present with bogus default
values for all other architectures. Instead of enforcing such new
sysfs attributes to all architectures, make them only optional visible
if an architecture opts in by defining both the topology_cluster_id
and topology_cluster_cpumask attributes.
This is similar to what was done when the book and drawer topology
levels were introduced: avoid useless and therefore confusing sysfs
attributes for architectures which cannot make use of them.
This should not break any existing applications, since this is a
new interface introduced with the v5.16 merge window.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20211129130309.3256168-3-hca@linux.ibm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
mainline inclusion
from mainline-v5.16-rc1
commit c5e22feffd upstream.
------------------------------------------------------------------------
Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.
For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus.
+-----------------------------------+ +---------+
| +------+ +------+ +--------------------------+ |
| | CPU0 | | cpu1 | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ cluster | | tag | | |
| | CPU2 | | CPU3 | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +----+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | L3 |
| data |
+-----------------------------------+ | |
| +------+ +------+ | +-----------+ | |
| | | | | | | | | |
| +------+ +------+ +----+ L3 | | |
| | | tag | | |
| +------+ +------+ | | | | |
| | | | | | +-----------+ | |
| +------+ +------+ +--------------------------+ |
+-----------------------------------| | |
+-----------------------------------| | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ | | tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +---+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +--+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | +---------+
+-----------------------------------+
That means spreading tasks among clusters will bring more bandwidth
while packing tasks within one cluster will lead to smaller cache
synchronization latency. So both kernel and userspace will have
a chance to leverage this topology to deploy tasks accordingly to
achieve either smaller cache latency within one cluster or an even
distribution of load among clusters for higher throughput.
This patch exposes cluster topology to both kernel and userspace.
Libraried like hwloc will know cluster by cluster_cpus and related
sysfs attributes. PoC of HWLOC support at [2].
Note this patch only handle the ACPI case.
Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).
Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.
[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit 4ee4ea443a upstream.
------------------------------------------------------------------------
In preparation of cleaning up the sd_degenerate*() functions, mark flags
used in sd_degenerate() with the new SDF_NEEDS_GROUPS flag. With this,
build a compile-time mask of those SD flags.
Note that sd_parent_degenerate() uses an extra flag in its mask,
SD_PREFER_SIBLING, which remains singled out for now.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-8-valentin.schneider@arm.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit b6e862f386 upstream.
------------------------------------------------------------------------
There are some expectations regarding how sched domain flags should be laid
out, but none of them are checked or asserted in
sched_domain_debug_one(). After staring at said flags for a while, I've
come to realize there's two repeating patterns:
- Shared with children: those flags are set from the base CPU domain
upwards. Any domain that has it set will have it set in its children. It
hints at "some property holds true / some behaviour is enabled until this
level".
- Shared with parents: those flags are set from the topmost domain
downwards. Any domain that has it set will have it set in its parents. It
hints at "some property isn't visible / some behaviour is disabled until
this level".
There are two outliers that (currently) do not map to either of these:
o SD_PREFER_SIBLING, which is cleared below levels with
SD_ASYM_CPUCAPACITY. The change was introduced by commit:
9c63e84db2 ("sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains")
as it could break misfit migration on some systems. In light of this, we
might want to change it back to make it fit one of the two categories and
fix the issue another way.
o SD_ASYM_CPUCAPACITY, which gets set on a single level and isn't
propagated up nor down. From a topology description point of view, it
really wants to be SDF_SHARED_PARENT; this will be rectified in a later
patch.
Tweak the sched_domain flag declaration to assign each flag an expected
layout, and include the rationale for each flag "meta type" assignment as a
comment. Consolidate the flag metadata into an array; the index of a flag's
metadata can easily be found with log2(flag), IOW __ffs(flag).
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-5-valentin.schneider@arm.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit d54a9658a7 upstream.
------------------------------------------------------------------------
To associate the SD flags with some metadata, we need some more structure
in the way they are declared.
Rather than shove that in a free-standing macro list, move the declaration
in a separate file that can be re-imported with different SD_FLAG
definitions. This is inspired by what is done with the syscall
table (see uapi/asm/unistd.h and sys_call_table).
The value assigned to a given SD flag now depends on the order it appears
in sd_flags.h. No change in functionality.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-4-valentin.schneider@arm.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit cfe7ddcbd7 upstream.
------------------------------------------------------------------------
This flag was introduced in 2014 by commit:
d77b3ed5c9 ("sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain")
but AFAIA it was never leveraged by the scheduler. The closest thing I can
think of is EAS caring about frequency domains, and it does that by
leveraging performance domains.
Remove the flag. No change in functionality is expected.
Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-2-valentin.schneider@arm.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit 36c5bdc438 upstream.
------------------------------------------------------------------------
That flag is set unconditionally in sd_init(), and no one checks for it
anymore. Remove it.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-5-valentin.schneider@arm.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit e669ac8ab9 upstream.
------------------------------------------------------------------------
The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init(). By making the sched_domain->flags syctl interface read-only, we
have removed the last piece of code that could clear that flag - as such,
it will now be always present. Rather than to keep carrying it along, we
can work towards getting rid of it entirely.
cpusets don't need it because they can make CPUs be attached to the NULL
domain (e.g. cpuset with sched_load_balance=0), or to a partitioned
root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
sched_load_balance=1).
isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
the sched_domain rebuild (using housekeeping_cpumask()), so they get the
NULL domain treatment as well.
Remove the checks against SD_LOAD_BALANCE.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-4-valentin.schneider@arm.com
Signed-off-by: Xue Sinian <tangyuan911@yeah.net>
commit 0252aa08aafb4a40ea2d821f58e88e99a644b097 openeuler.
When I compile the kernel with CONFIG_PPC_WATCHDOG is disabled on
PowerPC, I got the following compile error:
In file included from kernel/hung_task.c:11:0:
./include/linux/nmi.h: In function ‘touch_nmi_watchdog’:
./include/linux/nmi.h:143:2: error: implicit declaration of function ‘arch_touch_nmi_watchdog’; did you mean ‘touch_nmi_watchdog’? [-Werror=implicit-function-declaration]
arch_touch_nmi_watchdog();
^~~~~~~~~~~~~~~~~~~~~~~
touch_nmi_watchdog
It is because CONFIG_HARDLOCKUP_DETECTOR_PERF is still enabled in my
situation. Fix it by excluding arch_touch_nmi_watchdog() only when
CONFIG_PPC_WATCHDOG is disabled.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 0fa83fd0f8f7267be1e31c824cedb9d112504785 openeuler.
Firmware may not trigger SDEI event as required frequency. SDEI event
may be triggered too soon, which cause false hardlockup in kernel. Check
the time stamp in sdei_watchdog_callbak and skip the hardlockup check if
it is invoked too soon.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit bdda54cc39843589ee91a0176ca9a94adf307763 openeuler.
Functions called in sdei_handler are not allowed to be kprobed, so
marked them as NOKPROBE_SYMBOL. There are so many functions in
'watchdog_check_timestamp()'. Luckily, we don't need
'CONFIG_HARDLOCKUP_CHECK_TIMESTAMP' now. So just make
CONFIG_SDEI_WATCHDOG depends on !CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
in case someone add 'CONFIG_HARDLOCKUP_CHECK_TIMESTAMP' in the future.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 13ddc12768ca98d36ec03bfa21a30b3ebc91673d openeuler.
The period of the secure timer is set to 3s by BIOS. That means the
secure timer interrupt will trigger every 3 seconds. To further decrease
the NMI watchdog's effect on performance, this patch set the period of
the secure timer base on 'watchdog_thresh'. This variable is initiallized
to 10s. We can also set the period at runtime by modifying
'/proc/sys/kernel/watchdog_thresh'
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 75ac7be96da43f12bad247de69137500e02fd37f openeuler.
When we panic in hardlockup, the secure timer interrupt remains activate
because firmware clear eoi after dispatch is completed. This will cause
arm_arch_timer interrupt failed to trigger in the second kernel.
This patch add a new SMC helper to clear eoi of a certain interrupt and
clear eoi of the secure timer before booting the second kernel.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 5bc048a102ef9c3748464cacce443a0f1d9bed5b openeuler.
The trigger period of secure time is set by firmware. We need to check
the time_stamp every time the secure time fires to make sure the
hardlockup detection is not executed too soon. We need to refresh
'last_timestamp' to the current time when we enable the nmi_watchdog.
Otherwise, false hardlockup may be detected when the secure timer fires
the first time.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit cc19c0b385e3bd423e20465b06eb232678ce5c16 openeuler.
Add nmi_watchdog support for arm64 based on SDEI.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit bef7d8e1432400f3d78339ac269167e09c15dabd openeuler.
We call 'sdei_init' as 'subsys_initcall_sync'. lockup detector need to
be initialised after sdei_init. The influence of this patch is that we
can not detect the hard lockup in init_calls.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit cfaccce945988392d70ad42924e76f330c25ab9a openeuler.
NMI Watchdog need to enable the event for each core individually. But the
existing public api 'sdei_event_enable' enable events for all cores when
the event type is private.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 860744b94a10a159562fc491fd7f3ea1388965c1 openeuler.
This patch add a interrupt binding api function which returns the binded
event number.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 4ffed7d5435d12be6762e6fdef92fd2c67fc27df openeuler.
In current code, the hardlockup detect code is contained by
CONFIG_HARDLOCKUP_DETECTOR_PERF. This patch makes this code public so
that other arch hardlockup detector can use it.
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit d8f6267f7c upstream.
Add required PMU interrupt operations for NMIs. Request interrupt lines as
NMIs when possible, otherwise fall back to normal interrupts.
NMIs are only supported on the arm64 architecture with a GICv3 irqchip.
[Alexandru E.: Added that NMIs only work on arm64 + GICv3, print message
when PMU is using NMIs]
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20200924110706.254996-8-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit f76b130bdb upstream.
Currently the PMU interrupt can either be a normal irq or a percpu irq.
Supporting NMI will introduce two cases for each existing one. It becomes
a mess of 'if's when managing the interrupt.
Define sets of callbacks for operations commonly done on the interrupt. The
appropriate set of callbacks is selected at interrupt request time and
simplifies interrupt enabling/disabling and freeing.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20200924110706.254996-7-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 95e92e45a4 upstream.
kvm_vcpu_kick() is not NMI safe. When the overflow handler is called from
NMI context, defer waking the vcpu to an irq_work queue.
A vcpu can be freed while it's not running by kvm_destroy_vm(). Prevent
running the irq_work for a non-existent vcpu by calling irq_work_sync() on
the PMU destroy path.
[Alexandru E.: Added irq_work_sync()]
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Pouloze <suzuki.poulose@arm.com>
Cc: kvm@vger.kernel.org
Cc: kvmarm@lists.cs.columbia.edu
Link: https://lore.kernel.org/r/20200924110706.254996-6-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 05ab728133 upstream.
When handling events, armv8pmu_handle_irq() calls perf_event_overflow(),
and subsequently calls irq_work_run() to handle any work queued by
perf_event_overflow(). As perf_event_overflow() raises IPI_IRQ_WORK when
queuing the work, this isn't strictly necessary and the work could be
handled as part of the IPI_IRQ_WORK handler.
In the common case the IPI handler will run immediately after the PMU IRQ
handler, and where the PE is heavily loaded with interrupts other handlers
may run first, widening the window where some counters are disabled.
In practice this window is unlikely to be a significant issue, and removing
the call to irq_work_run() would make the PMU IRQ handler NMI safe in
addition to making it simpler, so let's do that.
[Alexandru E.: Reworded commit message]
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20200924110706.254996-5-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 2a0e2a02e4 upstream.
The PMU is disabled and enabled, and the counters are programmed from
contexts where interrupts or preemption is disabled.
The functions to toggle the PMU and to program the PMU counters access the
registers directly and don't access data modified by the interrupt handler.
That, and the fact that they're always called from non-preemptible
contexts, means that we don't need to disable interrupts or use a spinlock.
[Alexandru E.: Explained why locking is not needed, removed WARN_ONs]
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20200924110706.254996-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 0fdf1bb759 upstream.
Currently we access the counter registers and their respective type
registers indirectly. This requires us to write to PMSELR, issue an ISB,
then access the relevant PMXEV* registers.
This is unfortunate, because:
* Under virtualization, accessing one register requires two traps to
the hypervisor, even though we could access the register directly with
a single trap.
* We have to issue an ISB which we could otherwise avoid the cost of.
* When we use NMIs, the NMI handler will have to save/restore the select
register in case the code it preempted was attempting to access a
counter or its type register.
We can avoid these issues by directly accessing the relevant registers.
This patch adds helpers to do so.
In armv8pmu_enable_event() we still need the ISB to prevent the PE from
reordering the write to PMINTENSET_EL1 register. If the interrupt is
enabled before we disable the counter and the new event is configured,
we might get an interrupt triggered by the previously programmed event
overflowing, but which we wrongly attribute to the event that we are
enabling. Execute an ISB after we disable the counter.
In the process, remove the comment that refers to the ARMv7 PMU.
[Julien T.: Don't inline read/write functions to avoid big code-size
increase, remove unused read_pmevtypern function,
fix counter index issue.]
[Alexandru E.: Removed comment, removed trailing semicolons in macros,
added ISB]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20200924110706.254996-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 490d7b7c08 upstream.
Writes to the PMXEVTYPER_EL0 register are not self-synchronising. In
armv8pmu_enable_event(), the PE can reorder configuring the event type
after we have enabled the counter and the interrupt. This can lead to an
interrupt being asserted because of the previous event type that we were
counting using the same counter, not the one that we've just configured.
The same rationale applies to writes to the PMINTENSET_EL1 register. The PE
can reorder enabling the interrupt at any point in the future after we have
enabled the event.
Prevent both situations from happening by adding an ISB just before we
enable the event counter.
Fixes: 030896885a ("arm64: Performance counters support")
Reported-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Sumit Garg <sumit.garg@linaro.org> (Developerbox)
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20200924110706.254996-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: huwentao <huwentao19@h-partners.com>
commit 7c73ddb7589fb8ddb1136b6306dfb72089c81511 upstream.
jbd2_transaction_committed() is used to check whether a transaction with
the given tid has already committed, it holds j_state_lock in read mode
and check the tid of current running transaction and committing
transaction, but holding the j_state_lock is expensive.
We have already stored the sequence number of the most recently
committed transaction in journal t->j_commit_sequence, we could do this
check by comparing it with the given tid instead. If the given tid isn't
smaller than j_commit_sequence, we can ensure that the given transaction
has been committed. That way we could drop the expensive lock and
achieve about 10% ~ 20% performance gains in concurrent DIOs on may
virtual machine with 100G ramdisk.
fio -filename=/mnt/foo -direct=1 -iodepth=10 -rw=$rw -ioengine=libaio \
-bs=4k -size=10G -numjobs=10 -runtime=60 -overwrite=1 -name=test \
-group_reporting
Before:
overwrite IOPS=88.2k, BW=344MiB/s
read IOPS=95.7k, BW=374MiB/s
rand overwrite IOPS=98.7k, BW=386MiB/s
randread IOPS=102k, BW=397MiB/s
After:
overwrite IOPS=105k, BW=410MiB/s
read IOPS=112k, BW=436MiB/s
rand overwrite IOPS=104k, BW=404MiB/s
randread IOPS=111k, BW=432MiB/s
CC: Dave Chinner <david@fromorbit.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/linux-ext4/ZjILCPNZRHeazSqV@dread.disaster.area/
Signed-off-by: huwentao <huwentao19@h-partners.com>
\#MC is a NMI-like exception. But do not do any setup that NMIs need.
This will lead to console_owner_lock issue and HPET dead loop issue.
For example, The HPET dead loop issue:
CPU x CPU x
---- ----
read_hpet()
arch_spin_trylock(&hpet.lock)
[CPU x got the hpet.lock] #MCE happened
do_machine_check()
mce_panic()
panic()
kmsg_dump()
pstore_dump()
pstore_record_init()
ktime_get_real_fast_ns()
read_hpet()
[dead loops]
This may lead to read_hpet dead loops.
The console_owner_lock issue is similar.
To avoid these issues, add NMIs setup When Handling #MC Exceptions.
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
OverCurrent condition is not standardized in the UHCI spec.
Zhaoxin UHCI controllers report OverCurrent bit active off.
In order to handle OverCurrent condition correctly, the uhci-hcd
driver needs to be told to expect the active-off behavior.
Suggested-by: Alan Stern <stern@rowland.harvard.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Weitao Wang <WeitaoWang-oc@zhaoxin.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://lore.kernel.org/r/20230423105952.4526-1-WeitaoWang-oc@zhaoxin.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
zhaoxin inclusion
category: feature
-------------------
Add the new PCI ID 0x1d17 0x9141/0x9142/0x9144 Zhaoxin NB HDAC
support. And add some special initialization for Zhaoxin NB HDAC.
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>