A fix to the handling of misaligned paste instructions (P9 only), where a change
to a #define has caused the check for the instruction to always fail.
The preempt handling was unbalanced in the radix THP flush (P9 only). Though we
don't generally use preempt we want to keep it working as much as possible.
Two fixes for IMC (P9 only), one when booting with restricted number of CPUs and
one in the error handling when initialisation fails due to firmware etc.
A revert to fix function_graph on big endian machines, and then a rework of the
reverted patch to fix kprobes blacklist handling on big endian machines.
Thanks to:
Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan, Naveen N. Rao,
Nicholas Piggin, Paul Mackerras.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZ/ASDAAoJEFHr6jzI4aWAtWYP/34uCIJ83roZHfBdASGcczeO
AEOnxZdVdAB6ynU0eM2JSlJMhx39mqT5qJQP4ucUlqNa+SHhFxTW4EmoQm/qAr9N
4xhwix6OAsMleiPJx+RGGAnKB8rFhUEYH/+cql67FAq3u8zRbAdurR/A1lnIAY9x
2bQ85GsjFlRSFdmSohXegBJ7Xey3QE3jpj+co3dAEm3JQVRHNpGqgvRy8yeowvod
c/hHH7vmdKyDCZDtS5uF/egZOJLK1mueVuV69O0KKxT7xtEKk1cjFB08aHhcuJDm
QIk5Efr8iCbn4EyRBzthij7YJ0VPBTNCmsGcf6RorrYv3hT0uN72qQPjg13WH7HY
wkdMtkQvaFLbnZDqOxm9saczE27Ce7QNe1PNh6eapNlunz2nsb8Z/SvSyqskK3Yh
5cBxHrfi+NcJHvdDAJNobNoJ3vlTJGI0qQ/dSexJsXxZw2zR3tPojRKCtK0pZGql
xKDSqU7AbuUhLMxVBDDSQ+KONwyaBS84baLc98dapflw7htGWU339c5f9JTmnGvZ
7fWT3hNihAmZNZZ957BC5J0h5kgkKgZ7y67SA/+RI0o2ZoJlxoeDYAyqc7dwfvs+
ncxG4S6okSQP/EHf9bRSdYeJsZup/zP7+wD+02cbJd5zFJ+6cfu0AhPDRcBNMjHg
EPkZvfZzRxTlGvmHCoWv
=vTi1
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 4.14.
This is bigger than I like to send at rc7, but that's at least partly
because I didn't send any fixes last week. If it wasn't for the IMC
driver, which is new and getting heavy testing, the diffstat would
look a bit better. I've also added ftrace on big endian to my test
suite, so we shouldn't break that again in future.
- A fix to the handling of misaligned paste instructions (P9 only),
where a change to a #define has caused the check for the
instruction to always fail.
- The preempt handling was unbalanced in the radix THP flush (P9
only). Though we don't generally use preempt we want to keep it
working as much as possible.
- Two fixes for IMC (P9 only), one when booting with restricted
number of CPUs and one in the error handling when initialisation
fails due to firmware etc.
- A revert to fix function_graph on big endian machines, and then a
rework of the reverted patch to fix kprobes blacklist handling on
big endian machines.
Thanks to: Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras"
* tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf: Fix core-imc hotplug callback failure during imc initialization
powerpc/kprobes: Dereference function pointers only if the address does not belong to kernel text
Revert "powerpc64/elfv1: Only dereference function descriptor for non-text symbols"
powerpc/64s/radix: Fix preempt imbalance in TLB flush
powerpc: Fix check for copy/paste instructions in alignment handler
powerpc/perf: Fix IMC allocation routine
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZ9kEFAAoJEHm+PkMAQRiGw6wH/0j197qyGd0hkVFMJO6LAgN3
KQWS4nZ5BkVDocwv0RVnUJTtXqU1eozFgdVEtSoaFXpzlHGuptR2Tau9efDCJ7w3
/utZxqvhGebZd2T+j+/o/LE8BRQxhADBNJq2D/o0WNt8ecxuG0GIkhkEYt/o3z1v
/sxlwVwzXB7Dc/h1WcgGJG7cS6L9KzzAzGAS/iNvdFrPOygHBv8c0MxVZIiBIeeK
1nZdyvbyM8uenSyG+prGt9ENrqXZxxfwUxIchi2V7A9m1WmD5zijNkf1JCWji/O+
UsA1auxna7MwoxjxqZuGm4MlKOwZ+8xutk4JGgc+aP/ulndJbJYu+4op/3vaFBM=
=Mhx+
-----END PGP SIGNATURE-----
Backmerge tag 'v4.14-rc7' into drm-next
Linux 4.14-rc7
Requested by Ben Skeggs for nouveau to avoid major conflicts,
and things were getting a bit conflicty already, esp around amdgpu
reverts.
This makes the changes introduced in commit 83e840c770
("powerpc64/elfv1: Only dereference function descriptor for non-text
symbols") to be specific to the kprobe subsystem.
We previously changed ppc_function_entry() to always check the provided
address to confirm if it needed to be dereferenced. This is actually
only an issue for kprobe blacklisted asm labels (through use of
_ASM_NOKPROBE_SYMBOL) and can cause other issues with ftrace. Also, the
additional checks are not really necessary for our other uses.
As such, move this check to the kprobes subsystem.
Fixes: 83e840c770 ("powerpc64/elfv1: Only dereference function descriptor for non-text symbols")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch removes the restriction that a radix host can only run
radix guests, allowing us to run HPT (hashed page table) guests as
well. This is useful because it provides a way to run old guest
kernels that know about POWER8 but not POWER9.
Unfortunately, POWER9 currently has a restriction that all threads
in a given code must either all be in HPT mode, or all in radix mode.
This means that when entering a HPT guest, we have to obtain control
of all 4 threads in the core and get them to switch their LPIDR and
LPCR registers, even if they are not going to run a guest. On guest
exit we also have to get all threads to switch LPIDR and LPCR back
to host values.
To make this feasible, we require that KVM not be in the "independent
threads" mode, and that the CPU cores be in single-threaded mode from
the host kernel's perspective (only thread 0 online; threads 1, 2 and
3 offline). That allows us to use the same code as on POWER8 for
obtaining control of the secondary threads.
To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
struct to contain the information needed by the secondary threads.
All threads perform a barrier synchronization (where all threads wait
for every other thread to reach the synchronization point) on guest
entry, both before and after loading LPCR and LPIDR. On guest exit,
they all once again perform a barrier synchronization both before
and after loading host values into LPCR and LPIDR.
Finally, it is also currently necessary to flush the entire TLB every
time we enter a HPT guest on a radix host. We do this on thread 0
with a loop of tlbiel instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the ppc-kvm topic branch of the powerpc tree to get the
commit that reverts the patch "KVM: PPC: Book3S HV: POWER9 does not
require secondary thread management". This is needed for subsequent
patches which will be applied on this branch.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Commit 07d2a628bc ("powerpc/64s: Avoid cpabort in context switch
when possible", 2017-06-09) changed the definition of PPC_INST_COPY
and in so doing inadvertently broke the check for copy/paste
instructions in the alignment fault handler. The check currently
matches no instructions.
This fixes it by ANDing both sides of the comparison with the mask.
Fixes: 07d2a628bc ("powerpc/64s: Avoid cpabort in context switch when possible")
Cc: stable@vger.kernel.org # v4.13+
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back in 2008 we added support for "fast little-endian switch" in the
syscall path. This added a special case syscall number 0x1ebe, which
is caught very early in the system call exception and switches endian
with as little overhead as possible. See commit 745a14cc26
("[POWERPC] Add fast little-endian switch system call") for full
details.
Although it is fast, it's also completely non standard. The "syscall
number" is out of the range of normal syscalls, it can't be traced or
audited, and it's a bit of a wart. To the best of our knowledge it was
only used by one program, now long since discontinued.
So in an effort to shake out any current users, put it behind a config
option, and make it default n. If anyone *is* using it they can
quickly reinstate it with a rebuild, and we can flip it to default y.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unfortunately userspace can construct a sigcontext which enables
suspend. Thus userspace can force Linux into a path where trechkpt is
executed.
This patch blocks this from happening on POWER9 by sanity checking
sigcontexts passed in.
ptrace doesn't have this problem as only MSR SE and BE can be changed
via ptrace.
This patch also adds a number of WARN_ON()s in case we ever enter
suspend when we shouldn't. This should not happen, but if it does the
symptoms are soft lockup warnings which are not obviously TM related,
so the WARN_ON()s should make it obvious what's happening.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some Power9 revisions can run in a mode where TM operates without
suspended state. If we find ourself on a CPU that might be in this
mode, we query OPAL to check, and if so we reenable TM in CPU
features, and enable a new user feature to signal to userspace that we
are in this mode.
We do not enable the "normal" user feature, PPC_FEATURE2_HTM, but we
do enable PPC_FEATURE2_HTM_NOSC because that indicates to userspace
that the kernel will abort transactions on syscall entry, which is
true regardless of the suspend mode.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the kernel relies on firmware to inform it whether or not the
CPU supports HTM and as long as the kernel was built with
CONFIG_PPC_TRANSACTIONAL_MEM=y then it will allow userspace to make
use of the facility.
There may be situations where it would be advantageous for the kernel
to not allow userspace to use HTM, currently the only way to achieve
this is to recompile the kernel with CONFIG_PPC_TRANSACTIONAL_MEM=n.
This patch adds a simple commandline option so that HTM can be
disabled at boot time.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Simplify to a bool, move to prom.c, put doco in the right place.
Always disable, regardless of initial state, to avoid user confusion.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commit 94a04bc25a.
In order to run HPT guests on a radix POWER9 host, we will have to run
the host in single-threaded mode, because POWER9 processors do not
currently support running some threads of a core in HPT mode while
others are in radix mode ("mixed mode").
That means that we will need the same mechanisms that are used on
POWER8 to make the secondary threads available to KVM, which were
disabled on POWER9 by commit 94a04bc25a.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch_spin_lock_flags() is an internal part of the spinlock implementation
and is no longer available when SMP=n and DEBUG_SPINLOCK=y, so the PPC
RTAS code fails to compile in this configuration:
arch/powerpc/kernel/rtas.c: In function 'lock_rtas':
>> arch/powerpc/kernel/rtas.c:81:2: error: implicit declaration of function 'arch_spin_lock_flags' [-Werror=implicit-function-declaration]
arch_spin_lock_flags(&rtas.lock, flags);
^~~~~~~~~~~~~~~~~~~~
Since there's no good reason to use arch_spin_lock_flags() here (the code
in question already calls local_irq_save(flags)), switch it over to
arch_spin_lock and get things building again.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508327469-20231-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Axtens reported that on the HiSilicon D05 board, the VGA device is
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so the VGA arbiter
never selects it as the default, which means Xorg auto-detection doesn't
work.
VGA is a legacy PCI feature: a VGA device can respond to addresses, e.g.,
[mem 0xa0000-0xbffff], [io 0x3b0-0x3bb], [io 0x3c0-0x3df], etc., that are
not configurable by BARs. Consequently, multiple VGA devices can conflict
with each other. The VGA arbiter avoids conflicts by ensuring that those
legacy resources are only routed to one VGA device at a time.
The arbiter identifies the "default VGA" device, i.e., a legacy VGA device
that was used by boot firmware. It selects the first device that:
- is of PCI_CLASS_DISPLAY_VGA,
- has both PCI_COMMAND_IO and PCI_COMMAND_MEMORY enabled, and
- has PCI_BRIDGE_CTL_VGA set in all upstream bridges.
Some systems don't have such a device. For example, if a host bridge
doesn't support I/O space, PCI_COMMAND_IO probably won't be enabled for any
devices below it. Or, as on the HiSilicon D05, the VGA device may be
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so accesses to the
legacy VGA resources will never reach the device.
This patch extends the arbiter so that if it doesn't find a device that
meets all the above criteria, it selects the first device that:
- is of PCI_CLASS_DISPLAY_VGA and
- has PCI_COMMAND_IO or PCI_COMMAND_MEMORY enabled
If it doesn't find even that, it selects the first device that:
- is of class PCI_CLASS_DISPLAY_VGA.
Such a device may not be able to use the legacy VGA resources, but most
drivers can operate the device without those. Setting it as the default
device means its "boot_vga" sysfs file will contain "1", which Xorg (via
libpciaccess) uses to help select its default output device.
This fixes Xorg auto-detection on some arm64 systems (HiSilicon D05 in
particular; see the link below).
It also replaces the powerpc fixup_vga() quirk, albeit with slightly
different semantics: the quirk selected the first VGA device we found, and
overrode that selection with any enabled VGA device we found. If there
were several enabled VGA devices, the *last* one we found would become the
default.
The code here instead selects the *first* enabled VGA device we find, and
if none are enabled, the first VGA device we find.
Link: http://lkml.kernel.org/r/20170901072744.2409-1-dja@axtens.net
Tested-by: Daniel Axtens <dja@axtens.net> # arm64, ppc64-qemu-tcg
Tested-by: Zhou Wang <wangzhou1@hisilicon.com> # D05 Hisi Hip07, Hip08
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20171013034721.14630.65913.stgit@bhelgaas-glaptop.roam.corp.google.com
If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.
We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.
We could probably poison the struct page to avoid a race
between detection and taking corrective action.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().
We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the same alignment as Effective address.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are no users of get_mce_fault_addr() since commit
1363875bdb ("powerpc/64s: fix handling of non-synchronous machine
checks") removed the last usage.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A fix for a bad bug (written by me) in our livepatch handler. Removal of an
over-zealous lockdep_assert_cpus_held() in our topology code. A fix to the
recently added emulation of cntlz[wd]. And three small fixes to the recently
added IMC PMU driver.
Thanks to:
Anju T Sudhakar, Balbir Singh, Kamalesh Babulal, Naveen N. Rao, Sandipan Das,
Santosh Sivaraj, Thiago Jung Bauermann.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZ4JXVAAoJEFHr6jzI4aWAi9MQAMBJLPG62uDmuhRY6v1MLp44
8iVvNqTSGXuzz2c293Zo3Y8LHQ28S7f5wdwZs75fZ6GVup033pkuhL8XipFTB87m
RzotMwJ1V1H0J7RRsDZ+QcyybwfE4GdqRHLjSU5VIUwJu7cQSIjjNWdkPUqswzXI
g3BbNoxEOlEx31NbyNB+as7AnSvS0lLMvwB7TEy9rf3mmadN3UiAl3OJ93M/7Nm2
qCWjCib/gUQ5U4N5STKWl62yGyvJ330OHoWdsTlHzHgYsEatQFHN8mjnW+UPN3rR
Mz+xCIkt6PnYMdiJNXND42iwdx/7C7BR9JhQ5t6150Swfnbe+CT5nBxk2+IDKpQu
V1rSX4S18TLQ+YCQm0wKuaQs/0EZ4kiHqUDZYVP/YQLSSYM5Ftf4W94Dwu+AZdtr
wWX3szZxqvCZjvdT8HWQQW8vAIqpdxOkr019fjoUzcDoUIYKs3cLDJVx6CAY6n3t
GGN3oGhffKbg1NyldDrZRDBJ+ie7gGincxlcMe1YrXDsXNum6edAhu01cUz7o9vO
b9/fIPjWAWotS7kqWXgGfZL6vlRx3cdSh58vKaThmNAbZLaIAnPTYUynXqXZ++Nn
bBDQUE9zwJ42ZgzYDr3bT6+9pwqPQ3w8zmOjZGB/J1ygS8k9/0gYVe1FO2N8vElM
w5VBKl69v2RwItwHuSRJ
=wWnt
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A fix for a bad bug (written by me) in our livepatch handler. Removal
of an over-zealous lockdep_assert_cpus_held() in our topology code. A
fix to the recently added emulation of cntlz[wd]. And three small
fixes to the recently added IMC PMU driver.
Thanks to: Anju T Sudhakar, Balbir Singh, Kamalesh Babulal, Naveen N.
Rao, Sandipan Das, Santosh Sivaraj, Thiago Jung Bauermann"
* tag 'powerpc-4.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf: Fix IMC initialization crash
powerpc/perf: Add ___GFP_NOWARN flag to alloc_pages_node()
powerpc/perf: Fix for core/nest imc call trace on cpuhotplug
powerpc: Don't call lockdep_assert_cpus_held() from arch_update_cpu_topology()
powerpc/lib/sstep: Fix count leading zeros instructions
powerpc/livepatch: Fix livepatch stack access
Use WARN_ON(), while running out of stubs in stub_for_addr()
and abort loading of the module instead of BUG_ON().
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
While running stress test with livepatch module loaded, kernel bug was
triggered.
cpu 0x5: Vector: 400 (Instruction Access) at [c0000000eb9d3b60]
5:mon> t
[c0000000eb9d3de0] c0000000eb9d3e30 (unreliable)
[c0000000eb9d3e30] c000000000008ab4 hardware_interrupt_common+0x114/0x120
--- Exception: 501 (Hardware Interrupt) at c000000000053040 livepatch_handler+0x4c/0x74
[c0000000eb9d4120] 0000000057ac6e9d (unreliable)
[d0000000089d9f78] 2e0965747962382e
SP (965747962342e09) is in userspace
When an interrupt occurs during the livepatch_handler execution, it's
possible for the livepatch_stack and/or thread_info to be corrupted.
eg:
Task A Interrupt Handler
========= =================
livepatch_handler:
mr r0, r1
ld r1, TI_livepatch_sp(r12)
hardware_interrupt_common:
do_IRQ+0x8:
mflr r0 <- saved stack pointer is overwritten
bl _mcount
...
std r27,-40(r1) <- overwrite of thread_info()
lis r2, STACK_END_MAGIC@h
ori r2, r2, STACK_END_MAGIC@l
ld r12, -8(r1)
Fix the corruption by using r11 register for livepatch stack
manipulation, instead of shuffling task stack and livepatch stack into
r1 register. Using r11 register also avoids disabling/enabling irq's
while setting up the livepatch stack.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge powerpc transactional memory fixes from Michael Ellerman:
"I figured I'd still send you the commits using a bundle to make sure
it works in case I need to do it again in future"
This fixes transactional memory state restore for powerpc.
* bundle'd patches from Michael Ellerman:
powerpc/tm: Fix illegal TM state in signal handler
powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
Nine small fixes, really nothing that stands out.
A work around for a spurious MCE on Power9. A CXL fault handling fix, some fixes
to the new XIVE code, and a fix to the new 32-bit STRICT_KERNEL_RWX code.
Fixes for old code/stable: an fix to an incorrect TLB flush on boot but not on
any current machines, a compile error on 4xx and a fix to memory hotplug when
using radix (Power9).
Thanks to:
Anton Blanchard, Cédric Le Goater, Christian Lamparter, Christophe Leroy,
Christophe Lombard, Guenter Roeck, Jeremy Kerr, Michael Neuling, Nicholas
Piggin.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZ11/yAAoJEFHr6jzI4aWAooYQAKjOKamLyxRotFGSqBfrfOMN
UfWKAmfeoF/gi4z03x8JdhcMQ1CMh5c/cQmrwzkzDosQ2m83CEWzNbk0gFV0Tpld
2qDTrEnBfTC4TRhCwWuOlaQC8LFeqxwpsCS6iXGpYoTY2NzmsbliPIQaruCIqj3f
F+7etfTqGJ+gCCkGfT0s7m96QO+daaT32nXL1VM8i/rOOpwZ23oj6kR3/ZSTWxBc
eyzvUcTLLhLp8hJoF0kU4f1VQtYijnpqK6ULr3wSfxYqjB9S4dsGyiF537aw5bKX
UPUCoMgkttVy8Jh35dtvKLuMqlfjHGqoe6UcezSG9a6wxeUnQRzsVmXihNDSkyEt
EawmPS6R5YaXXu9NKsleKBZDAdU4+EkcX24Wwn/lZy3wTKc1Lys+wH9hXuzzGNiH
JYEcMacXvpM1yFYNT+ouyLtoT86VWVCdwihueBMV74lyV87Wr55zS65XP6Z+Bb8/
aPL906sOAvnJW1JhHbYq/UxLfvs0enHNHpyR0xM4/oXi+RoGfhbM/MC1uJxv4w9z
12n1hbrpaRxHSSnJr1gahCrDoNUXLjkO2+0Ur3iXP4tfMMhwDs8qCILeaKpNmzkC
sO1ehOmXSSYqJLNiPmSIxs/QEzjFrvQdNwZOIav+3Tsc5ofR6wsauDCmfK5br2Mg
BdhW7LXtqL77cmOzN95e
=etbe
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Nine small fixes, really nothing that stands out.
A work-around for a spurious MCE on Power9. A CXL fault handling fix,
some fixes to the new XIVE code, and a fix to the new 32-bit
STRICT_KERNEL_RWX code.
Fixes for old code/stable: an fix to an incorrect TLB flush on boot
but not on any current machines, a compile error on 4xx and a fix to
memory hotplug when using radix (Power9).
Thanks to: Anton Blanchard, Cédric Le Goater, Christian Lamparter,
Christophe Leroy, Christophe Lombard, Guenter Roeck, Jeremy Kerr,
Michael Neuling, Nicholas Piggin"
* tag 'powerpc-4.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/powernv: Increase memory block size to 1GB on radix
powerpc/mm: Call flush_tlb_kernel_range with interrupts enabled
powerpc/xive: Clear XIVE internal structures when a CPU is removed
powerpc/xive: Fix IPI reset
powerpc/4xx: Fix compile error with 64K pages on 40x, 44x
powerpc: Fix action argument for cpufeatures-based TLB flush
cxl: Fix memory page not handled
powerpc: Fix workaround for spurious MCE on POWER9
powerpc: Handle MCE on POWER9 with only DSISR bit 30 set
Pull watchddog clean-up and fixes from Thomas Gleixner:
"The watchdog (hard/softlockup detector) code is pretty much broken in
its current state. The patch series addresses this by removing all
duct tape and refactoring it into a workable state.
The reasons why I ask for inclusion that late in the cycle are:
1) The code causes lockdep splats vs. hotplug locking which get
reported over and over. Unfortunately there is no easy fix.
2) The risk of breakage is minimal because it's already broken
3) As 4.14 is a long term stable kernel, I prefer to have working
watchdog code in that and the lockdep issues resolved. I wouldn't
ask you to pull if 4.14 wouldn't be a LTS kernel or if the
solution would be easy to backport.
4) The series was around before the merge window opened, but then got
delayed due to the UP failure caused by the for_each_cpu()
surprise which we discussed recently.
Changes vs. V1:
- Addressed your review points
- Addressed the warning in the powerpc code which was discovered late
- Changed two function names which made sense up to a certain point
in the series. Now they match what they do in the end.
- Fixed a 'unused variable' warning, which got not detected by the
intel robot. I triggered it when trying all possible related config
combinations manually. Randconfig testing seems not random enough.
The changes have been tested by and reviewed by Don Zickus and tested
and acked by Micheal Ellerman for powerpc"
* 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
watchdog/core: Put softlockup_threads_initialized under ifdef guard
watchdog/core: Rename some softlockup_* functions
powerpc/watchdog: Make use of watchdog_nmi_probe()
watchdog/core, powerpc: Lock cpus across reconfiguration
watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
watchdog/hardlockup/perf: Cure UP damage
watchdog/hardlockup: Clean up hotplug locking mess
watchdog/hardlockup/perf: Simplify deferred event destroy
watchdog/hardlockup/perf: Use new perf CPU enable mechanism
watchdog/hardlockup/perf: Implement CPU enable replacement
watchdog/hardlockup/perf: Implement init time detection of perf
watchdog/hardlockup/perf: Implement init time perf validation
watchdog/core: Get rid of the racy update loop
watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
watchdog/sysctl: Clean up sysctl variable name space
watchdog/sysctl: Get rid of the #ifdeffery
watchdog/core: Clean up header mess
watchdog/core: Further simplify sysctl handling
watchdog/core: Get rid of the thread teardown/setup dance
...
Currently it's possible that on returning from the signal handler
through the restore_tm_sigcontexts() code path (e.g. from a signal
caught due to a `trap` instruction executed in the middle of an HTM
block, or a deliberately constructed sigframe) an illegal TM state
(like TS=10 TM=0, i.e. "T0") is set in SRR1 and when `rfid` sets
implicitly the MSR register from SRR1 register on return to userspace
it causes a TM Bad Thing exception.
That illegal state can be set (a) by a malicious user that disables
the TM bit by tweaking the bits in uc_mcontext before returning from
the signal handler or (b) by a sufficient number of context switches
occurring such that the load_tm counter overflows and TM is disabled
whilst in the signal handler.
This commit fixes the illegal TM state by ensuring that TM bit is
always enabled before we return from restore_tm_sigcontexts(). A small
comment correction is made as well.
Fixes: 5d176f751e ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When using transactional memory (TM), the CPU can be in one of six
states as far as TM is concerned, encoded in the Machine State
Register (MSR). Certain state transitions are illegal and if attempted
trigger a "TM Bad Thing" type program check exception.
If we ever hit one of these exceptions it's treated as a bug, ie. we
oops, and kill the process and/or panic, depending on configuration.
One case where we can trigger a TM Bad Thing, is when returning to
userspace after a system call or interrupt, using RFID. When this
happens the CPU first restores the user register state, in particular
r1 (the stack pointer) and then attempts to update the MSR. However
the MSR update is not allowed and so we take the program check with
the user register state, but the kernel MSR.
This tricks the exception entry code into thinking we have a bad
kernel stack pointer, because the MSR says we're coming from the
kernel, but r1 is pointing to userspace.
To avoid this we instead always switch to the emergency stack if we
take a TM Bad Thing from the kernel. That way none of the user
register values are used, other than for printing in the oops message.
This is the fix for CVE-2017-1000255.
Fixes: 5d176f751e ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Rewrite change log & comments, tweak asm slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a check for p->state == TASK_RUNNING so that any wake-ups on
task_struct p in the interim lead to 0 being returned by get_wchan().
Signed-off-by: Kautuk Consul <kautuk.consul.1980@gmail.com>
[mpe: Confirmed other architectures do similar]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "reset" argument passed to pci_iov_add_virtfn() and
pci_iov_remove_virtfn() is always zero since 46cb7b1bd8 ("PCI: Remove
unused SR-IOV VF Migration support")
Remove the argument together with the associated code.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Fix a circa 2005 FIXME by implementing a check to ensure that we
actually got into the jprobe break handler() due to the trap in
jprobe_return().
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Kamalesh pointed out that we are getting the below call traces with
livepatched functions when we enable CONFIG_PREEMPT:
[ 495.470721] BUG: using __this_cpu_read() in preemptible [00000000] code: cat/8394
[ 495.471167] caller is is_current_kprobe_addr+0x30/0x90
[ 495.471171] CPU: 4 PID: 8394 Comm: cat Tainted: G K 4.13.0-rc7-nnr+ #95
[ 495.471173] Call Trace:
[ 495.471178] [c00000008fd9b960] [c0000000009f039c] dump_stack+0xec/0x160 (unreliable)
[ 495.471184] [c00000008fd9b9a0] [c00000000059169c] check_preemption_disabled+0x15c/0x170
[ 495.471187] [c00000008fd9ba30] [c000000000046460] is_current_kprobe_addr+0x30/0x90
[ 495.471191] [c00000008fd9ba60] [c00000000004e9a0] ftrace_call+0x1c/0xb8
[ 495.471195] [c00000008fd9bc30] [c000000000376fd8] seq_read+0x238/0x5c0
[ 495.471199] [c00000008fd9bcd0] [c0000000003cfd78] proc_reg_read+0x88/0xd0
[ 495.471203] [c00000008fd9bd00] [c00000000033e5d4] __vfs_read+0x44/0x1b0
[ 495.471206] [c00000008fd9bd90] [c0000000003402ec] vfs_read+0xbc/0x1b0
[ 495.471210] [c00000008fd9bde0] [c000000000342138] SyS_read+0x68/0x110
[ 495.471214] [c00000008fd9be30] [c00000000000bc6c] system_call+0x58/0x6c
Commit c05b8c4474 ("powerpc/kprobes: Skip livepatch_handler() for
jprobes") introduced a helper is_current_kprobe_addr() to help determine
if the current function has been livepatched or if it has a jprobe
installed, both of which modify the NIP. This was subsequently renamed
to __is_active_jprobe().
In the case of a jprobe, kprobe_ftrace_handler() disables pre-emption
before calling into setjmp_pre_handler() which returns without disabling
pre-emption. This is done to ensure that the jprobe handler won't
disappear beneath us if the jprobe is unregistered between the
setjmp_pre_handler() and the subsequent longjmp_break_handler() called
from the jprobe handler. Due to this, we can use __this_cpu_read() in
__is_active_jprobe() with the pre-emption check as we know that
pre-emption will be disabled.
However, if this function has been livepatched, we are still doing this
check and when we do so, pre-emption won't necessarily be disabled. This
results in the call trace shown above.
Fix this by only invoking __is_active_jprobe() when pre-emption is
disabled. And since we now guard this within a pre-emption check, we can
instead use raw_cpu_read() to get the current_kprobe value skipping the
check done by __this_cpu_read().
Fixes: c05b8c4474 ("powerpc/kprobes: Skip livepatch_handler() for jprobes")
Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit c05b8c4474 ("powerpc/kprobes: Skip livepatch_handler() for
jprobes"), we added a helper is_current_kprobe_addr() to help detect if
the modified regs->nip was due to a jprobe or livepatch. Masami felt
that the function name was not quite clear. To that end, this patch
renames is_current_kprobe_addr() to __is_active_jprobe() and adds a
comment to (hopefully) better clarify the purpose of this helper. The
helper has also now been moved to kprobes-ftrace.c so that it is only
available for KPROBES_ON_FTRACE.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, we disable instruction emulation if emulate_step() fails for
any reason. However, such failures could be transient and specific to a
particular run. Instead, only disable instruction emulation if we have
never been able to emulate this. If we had emulated this instruction
successfully at least once, then we single step only this probe hit and
continue to try emulating the instruction in subsequent probe hits.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
1. This is only used in kprobes.c, so make it static.
2. Remove the un-necessary (ret == 0) comparison in the else clause.
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
on powerpc because it is called multiple times for the boot CPU.
The first call is via:
start_wd_on_cpu+0x80/0x2f0
watchdog_nmi_reconfigure+0x124/0x170
softlockup_reconfigure_threads+0x110/0x130
lockup_detector_init+0xbc/0xe0
kernel_init_freeable+0x18c/0x37c
kernel_init+0x2c/0x160
ret_from_kernel_thread+0x5c/0xbc
And then again via the CPU hotplug registration:
start_wd_on_cpu+0x80/0x2f0
cpuhp_invoke_callback+0x194/0x620
cpuhp_thread_fun+0x7c/0x1b0
smpboot_thread_fn+0x290/0x2a0
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0xbc
This can be avoided by setting up the cpu hotplug state with nocalls and
move the initialization to the watchdog_nmi_probe() function. That
initializes the hotplug callbacks without invoking the callback and the
following core initialization function then configures the watchdog for the
online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
Reported-and-tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Instead of dropping the cpu hotplug lock after stopping NMI watchdog and
threads and reaquiring for restart, the code and the protection rules
become more obvious when holding cpu hotplug lock across the full
reconfiguration.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710022105570.2114@nanos
The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
into two stages. One to stop the NMI and one to restart it after
reconfiguration. That was done by adding a boolean 'run' argument to the
code, which is functionally correct but not necessarily a piece of art.
Replace it by two explicit functions: watchdog_nmi_stop() and
watchdog_nmi_start().
Fixes: 6592ad2fcc ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
Requested-by: Linus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos
Use setup_timer function instead of initializing timer with the
function and data fields.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This code is used at boot and machine checks, so it should be using
early_radix_enabled() (which is usable any time).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is possible to wake from idle due to a system reset exception, in
which case the CPU takes a system reset interrupt to wake from idle,
with system reset as the wakeup reason.
The regular (not idle wakeup) system reset interrupt handler must be
invoked in this case, otherwise the system reset interrupt is lost.
Handle the system reset interrupt immediately after CPU state has been
restored.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In xmon, touch_nmi_watchdog() is not expected to be checking that
other CPUs have not touched the watchdog, so the code will just call
touch_nmi_watchdog() once before re-enabling hard interrupts.
Just update our CPU's state, and ignore apparently stuck SMP threads.
Arguably touch_nmi_watchdog should check for SMP lockups, and callers
should be fixed, but that's not trivial for the input code of xmon.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If sysctl_hardlockup_all_cpu_backtrace is enabled, there is no need to
IPI stuck CPUs for backtrace before trigger_allbutself_cpu_backtrace(),
which does the same thing again.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The SMP watchdog will detect locked CPUs and IPI them to print a
backtrace and registers. If panic on hard lockup is enabled, do not
panic from this handler, because that can cause recursion into the IPI
layer during the panic.
The caller already panics in this case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The mmu context on the 40x, 44x does not define pte_frag entry. This
causes gcc abort the compilation due to:
setup-common.c: In function ‘setup_arch’:
setup-common.c:908: error: ‘mm_context_t’ has no ‘pte_frag’
This patch fixes the issue by removing the pte_frag initialization in
setup-common.c.
This is possible, because the compiler will do the initialization,
since the mm_context is a sub struct of init_mm. init_mm is declared
in mm_types.h as external linkage.
According to C99 6.2.4.3:
An object whose identifier is declared with external linkage
[...] has static storage duration.
C99 defines in 6.7.8.10 that:
If an object that has static storage duration is not
initialized explicitly, then:
- if it has pointer type, it is initialized to a null pointer
Fixes: b1923caa6e ("powerpc: Merge 32-bit and 64-bit setup_arch()")
Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 41d0c2ecde ("powerpc/powernv: Fix local TLB flush for boot
and MCE on POWER9") introduced calls to __flush_tlb_power[89] from the
cpufeatures code, specifying the number of sets to flush.
However, these functions take an action argument, not a number of
sets. This means we hit the BUG() in __flush_tlb_{206,300} when using
cpufeatures-style configuration.
This change passes TLB_INVAL_SCOPE_GLOBAL instead.
Fixes: 41d0c2ecde ("powerpc/powernv: Fix local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the recent commit d8bd9f3f09 ("powerpc: Handle MCE on POWER9 with
only DSISR bit 30 set") I screwed up the bit number. It should be bit
25 (IBM bit 38).
Fixes: d8bd9f3f09 ("powerpc: Handle MCE on POWER9 with only DSISR bit 30 set")
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 DD2.1 and earlier has an issue where some cache inhibited
vector load will return bad data. The workaround is two part, one
firmware/microcode part triggers HMI interrupts when hitting such
loads, the other part is this patch which then emulates the
instructions in Linux.
The affected instructions are limited to lxvd2x, lxvw4x, lxvb16x and
lxvh8x.
When an instruction triggers the HMI, all threads in the core will be
sent to the HMI handler, not just the one running the vector load.
In general, these spurious HMIs are detected by the emulation code and
we just return back to the running process. Unfortunately, if a
spurious interrupt occurs on a vector load that's to normal memory we
have no way to detect that it's spurious (unless we walk the page
tables, which is very expensive). In this case we emulate the load but
we need do so using a vector load itself to ensure 128bit atomicity is
preserved.
Some additional debugfs emulated instruction counters are added also.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Switch CONFIG_PPC_BOOK3S_64 to CONFIG_VSX to unbreak the build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9 DD2.1 and below, it's possible for a paste instruction to
cause a Machine Check Exception (MCE) where only DSISR bit 30 (IBM 33)
is set. This will result in the MCE handler seeing an unknown event,
which triggers linux to crash.
We change this by detecting unknown events caused by load/stores in
the MCE handler and marking them as handled so that we no longer
crash.
An MCE that occurs like this is spurious, so we don't need to do
anything in terms of servicing it. If there is something that needs to
be serviced, the CPU will raise the MCE again with the correct DSISR
so that it can be serviced properly.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com
Acked-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Expand comment with details from change log, use normal bit #s]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove the post_init callback which is only used
by powernv, we can just call it explicitly from
the powernv code.
This partially kills the ability to "disable" eeh at
runtime via debugfs as this was calling that same
callback again, but this is both unused and broken
in several ways. If we want to revive it, we need
to create a dedicated enable/disable callback on the
backend that does the right thing.
Let the bulk of eeh initialize normally at
core_initcall() like it does on pseries by removing
the hack in eeh_init() that delays it.
Instead we make sure our eeh->probe cleanly bails
out of the PEs haven't been created yet and we force
a re-probe where we used to call eeh_init() again.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Otherwise we end up not yet having computed the right diag data size
on powernv where EEH initialization is delayed, thus causing memory
corruption later on when calling OPAL.
Fixes: 5cb1f8fddd ("powerpc/powernv/pci: Dynamically allocate PHB diag data")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Optprobes depended on an updated regs->nip from analyse_instr() to
identify the location to branch back from the optprobes trampoline.
However, since commit 3cdfcbfd32 ("powerpc: Change analyse_instr so
it doesn't modify *regs"), analyse_instr() doesn't update the registers
anymore. Due to this, we end up branching back from the optprobes
trampoline to the same branch into the trampoline resulting in a loop.
Fix this by calling out to emulate_update_regs() before using the nip.
Additionally, explicitly compare the return value from analyse_instr()
to 1, rather than just checking for !0 so as to guard against any
future changes to analyse_instr() that may result in -1 being returned
in more scenarios.
Fixes: 3cdfcbfd32 ("powerpc: Change analyse_instr so it doesn't modify *regs")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump")
added code to access TM SPRs in flush_tmregs_to_thread(). However
flush_tmregs_to_thread() does not check if TM feature is available on
CPU before trying to access TM SPRs in order to copy live state to
thread structures. flush_tmregs_to_thread() is indeed guarded by
CONFIG_PPC_TRANSACTIONAL_MEM but it might be the case that kernel
was compiled with CONFIG_PPC_TRANSACTIONAL_MEM enabled and ran on
a CPU without TM feature available, thus rendering the execution
of TM instructions that are treated by the CPU as illegal instructions.
The fix is just to add proper checking in flush_tmregs_to_thread()
if CPU has the TM feature before accessing any TM-specific resource,
returning immediately if TM is no available on the CPU. Adding
that checking in flush_tmregs_to_thread() instead of in places
where it is called, like in vsr_get() and vsr_set(), is better because
avoids the same problem cropping up elsewhere.
Cc: stable@vger.kernel.org # v4.13+
Fixes: cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump")
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Just one fix, for the handling of alignment interrupts on dcbz instructions.
Thanks to:
Paul Mackerras, Christian Zigotzky, Michal Sojka.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZu5HyAAoJEFHr6jzI4aWA+yoP+gOKF4obMWnvSJ5YWZZWQee/
T++Uhps7AF0G2y99rBVUKzbXzweDPpBrNQW5cFlF51WinNCA4KJ4E08xoNSrnlKl
ebc6Tn2cHbpWgydSRxWDig/Dp90U95toKp14irO2YvOVWoTgYBX7gOHhkMcpxdLH
MeVe6IAatOSNNNpJK9gLrNDLZaDhTGqXf64VIaw44llf0R9avNFhoS+8F94bllm2
VSsXizP1HZxWFcxJ1W/F54zkOvOkSXxjmd9Lf0IOxERPAshZBO7MfDaXF9MaYXwL
4AU5GZc2UmWIJpwmSQqRQPpLOPFvsJkhNnXeUb/8kQAG2W+HfZy6BvHpVh1lajm7
j+tyIYe+E7Jt63tQxyxSKpGnu20po78iimciYhl77A2RksGl9Tw5sHWj8gok30TI
pQFIRLPAO38MnnD3OrsboAnBrruS0Og+lkSWan7nV5n8ybHflvimjG+MfkNx9vzL
as7yJThQ8D8JBRAyaIHG3aAUPgBDxpZwqhDU3rac0qZChOc9DH4X7YFHR1UzGH8n
J8h9e7IsKGfvm4O9j4t3/nq0Qwh0LvpAlfu1/tPdJ0KXc9BnQTVoCpkFzhA+IFV6
JY2y4F1MfW8ZWS20iPHLlYEXYcTtcTq5b47JYQx97TgSp82ShXSLwbUxXNwSraDv
umUvJle++yyRm927njki
=F9yZ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Just one fix, for the handling of alignment interrupts on dcbz
instructions.
Thanks to Paul Mackerras, Christian Zigotzky, Michal Sojka"
* tag 'powerpc-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Fix handling of alignment interrupt on dcbz instruction
This fixes the emulation of the dcbz instruction in the alignment
interrupt handler. The error was that we were comparing just the
instruction type field of op.type rather than the whole thing,
and therefore the comparison "type != CACHEOP + DCBZ" was always
true.
Fixes: 31bfdb036f ("powerpc: Use instruction emulation infrastructure to handle alignment faults")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Tested-by: Michal Sojka <sojkam1@fel.cvut.cz>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All watchdog thread related functions are delegated to the smpboot thread
infrastructure, which handles serialization against CPU hotplug correctly.
The sysctl interface is completely decoupled from anything which requires
CPU hotplug protection.
No need to protect the sysctl writes against cpu hotplug anymore. Remove it
and add the now required protection to the powerpc arch_nmi_watchdog
implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20170912194148.418497420@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both the perf reconfiguration and the powerpc watchdog_nmi_reconfigure()
need to be done in two steps.
1) Stop all NMIs
2) Read the new parameters and start NMIs
Right now watchdog_nmi_reconfigure() is a combination of both. To allow a
clean reconfiguration add a 'run' argument and split the functionality in
powerpc.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20170912194147.862865570@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This interface has several issues:
- It's causing recursive locking of the hotplug lock.
- It's complete overkill to teardown all threads and then recreate them
The same can be achieved with the simple hardlockup_detector_perf_stop /
restart() interfaces. The abuse from the busy looping poweroff() loop of
PARISC has been solved as well.
Remove the cruft.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Link: http://lkml.kernel.org/r/20170912194146.487537732@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
GFP_TEMPORARY was introduced by commit e12ba74d8f ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation. As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.
The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory. So
this is rather misleading and hard to evaluate for any benefits.
I have checked some random users and none of them has added the flag
with a specific justification. I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring. This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.
I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL. Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.
I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.
This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
seems to be a heuristic without any measured advantage for most (if not
all) its current users. The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers. So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.
[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace updates from Eric Biederman:
"Life has been busy and I have not gotten half as much done this round
as I would have liked. I delayed it so that a minor conflict
resolution with the mips tree could spend a little time in linux-next
before I sent this pull request.
This includes two long delayed user namespace changes from Kirill
Tkhai. It also includes a very useful change from Serge Hallyn that
allows the security capability attribute to be used inside of user
namespaces. The practical effect of this is people can now untar
tarballs and install rpms in user namespaces. It had been suggested to
generalize this and encode some of the namespace information
information in the xattr name. Upon close inspection that makes the
things that should be hard easy and the things that should be easy
more expensive.
Then there is my bugfix/cleanup for signal injection that removes the
magic encoding of the siginfo union member from the kernel internal
si_code. The mips folks reported the case where I had used FPE_FIXME
me is impossible so I have remove FPE_FIXME from mips, while at the
same time including a return statement in that case to keep gcc from
complaining about unitialized variables.
I almost finished the work to get make copy_siginfo_to_user a trivial
copy to user. The code is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3
But I did not have time/energy to get the code posted and reviewed
before the merge window opened.
I was able to see that the security excuse for just copying fields
that we know are initialized doesn't work in practice there are buggy
initializations that don't initialize the proper fields in siginfo. So
we still sometimes copy unitialized data to userspace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
Introduce v3 namespaced file capabilities
mips/signal: In force_fcr31_sig return in the impossible case
signal: Remove kernel interal si_code magic
fcntl: Don't use ambiguous SIG_POLL si_codes
prctl: Allow local CAP_SYS_ADMIN changing exe_file
security: Use user_namespace::level to avoid redundant iterations in cap_capable()
userns,pidns: Verify the userns for new pid namespaces
signal/testing: Don't look for __SI_FAULT in userspace
signal/mips: Document a conflict with SI_USER with SIGFPE
signal/sparc: Document a conflict with SI_USER with SIGFPE
signal/ia64: Document a conflict with SI_USER with SIGFPE
signal/alpha: Document a conflict with SI_USER for SIGTRAP
First, number of CPUs can't be negative number.
Second, different signnnedness leads to suboptimal code in the following
cases:
1)
kmalloc(nr_cpu_ids * sizeof(X));
"int" has to be sign extended to size_t.
2)
while (loff_t *pos < nr_cpu_ids)
MOVSXD is 1 byte longed than the same MOV.
Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".
Code savings on allyesconfig kernel: -3KB
add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25
...
pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116
Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On POWER9, the Client Architecture Support (CAS) negotiation process
determines whether the guest operates in XIVE Legacy compatibility or
in XIVE exploitation mode. Now that we have initial guest support for
the XIVE interrupt controller, let's inform the hypervisor what we can
do.
The platform advertises the XIVE Exploitation Mode support using the
property "ibm,arch-vec-5-platform-support-vec-5", byte 23 bits 0-1 :
- 0b00 XIVE legacy mode Only
- 0b01 XIVE exploitation mode Only
- 0b10 XIVE legacy or exploitation mode
The OS asks for XIVE Exploitation Mode support using the property
"ibm,architecture-vec-5", byte 23 bits 0-1:
- 0b00 XIVE legacy mode Only
- 0b01 XIVE exploitation mode Only
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use DEVICE_ATTR_RW for read-write attributes. This simplifies the
source code, improves readbility, and reduces the chance of
inconsistencies.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Omit an extra message for a memory allocation failure in
eeh_dev_init().
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[mpe: Do not drop the message that can happen at runtime and lead to
an event not being handled]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
memset() is patched after initialisation to activate the
optimised part which uses cache instructions.
Today we have a 'b 2f' to skip the optimised patch, which then gets
replaced by a NOP, implying a useless cycle consumption.
As we have a 'bne 2f' just before, we could use that instruction
for the live patching, hence removing the need to have a
dedicated 'b 2f' to be replaced by a NOP.
This patch changes the 'bne 2f' by a 'b 2f'. During init, that
'b 2f' is then replaced by 'bne 2f'
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This replaces almost all of the instruction emulation code in
fix_alignment() with calls to analyse_instr(), emulate_loadstore()
and emulate_dcbz(). The only emulation code left is the SPE
emulation code; analyse_instr() etc. do not handle SPE instructions
at present.
One result of this is that we can now handle alignment faults on
all the new VSX load and store instructions that were added in POWER9.
VSX loads/stores will take alignment faults for unaligned accesses
to cache-inhibited memory.
Another effect is that we no longer rely on the DAR and DSISR values
set by the processor.
With this, we now need to include the instruction emulation code
unconditionally.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton noticed that if we fault part way through emulating an unaligned
instruction, we don't update the DAR to reflect that.
The DAR value is eventually reported back to userspace as the address
in the SEGV signal, and if userspace is using that value to demand
fault then it can be confused by us not setting the value correctly.
This patch is ugly as hell, but is intended to be the minimal fix and
back ports easily.
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
In previous generations of Power processors each core had a private L2
cache. The Power 9 processor has a slightly different design where the
L2 cache is shared among pairs of cores rather than being completely
private.
Making the scheduler aware of this cache sharing allows the scheduler to
make better migration decisions. For example, if two CPU heavy tasks
share a core then one task can be migrated to the paired core to improve
throughput. Under the existing three level topology the task could be
migrated to any core on the same chip, while with the new topology it
would be preferentially migrated to the paired core so it remains
cache-hot.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We want to add an extra level to the CPU scheduler topology to account
for cores which share a cache. To do this we need to build a cpumask
for each CPU that indicates which CPUs share this cache to use as an
input to the scheduler.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The CPU scheduler topology is constructed from a number of per-cpu
cpumasks which describe which sets of logical CPUs are related in some
fashion. Current code that handles constructing these masks when CPUs
are hot(un)plugged can be simplified a bit by exploiting the fact that
the scheduler requires higher levels of the toplogy (e.g package level
groupings) to be supersets of the lower levels (e.g. threas in a core).
This patch reworks the cpumask construction to be simpler and easier to
extend with extra topology levels.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Fix CONFIG_HOTPLUG_CPU=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When building the CPU scheduler topology the kernel uses the ibm,chipid
property from the devicetree to group logical CPUs. Currently the DT
search for this property is open-coded in smp.c and this functionality
is a duplication of what's in cpu_to_chip_id() already. This patch
removes the existing search in favor of that.
It's worth mentioning that the semantics of the search are different
in cpu_to_chip_id(). When there is no ibm,chipid in the CPUs node it
will also search /cpus and / for the property, but this should not
effect the output topology.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
.llong is an undocumented PPC specific directive. The generic
equivalent is .quad, but even better (because it's self describing) is
.8byte.
Convert all .llong directives to .8byte.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For a PCI device it's pci_dn can be retrieved from
pdev->dev.archdata.firmware_data, PCI_DN(devnode), or parent's list.
Thus, we should just use the existing function pci_get_pdn_by_devfn
to get the pci_dn.
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The check_req() helper uses pci_get_pdn() to get an OF node pointer.
pci_get_pdn() returns a pci_dn pointer which either:
1) from the OF node returned by pci_device_to_OF_node();
2) from the parent child_list where entries don't have OF node pointers.
Since check_req() does not care about 2), it can call
pci_device_to_OF_node() directly, hence the change.
The find_pe_dn() helper uses embedded pci_dn to get an OF node which is
also stored in edev->pdev so let's take a shortcut and call
pci_device_to_OF_node() directly.
With these 2 changes, we can finally get rid of the OF node back pointer.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pci_dn struct caches a OF device node pointer in order to access
the "ibm,loc-code" property when EEH is recovering.
However, when this happens in eeh_dev_check_failure(), we also have
a pci_dev pointer which should have a valid pointer to the device node
when pci_dn has one (both pointers are not NULL for physical functions
and are NULL for virtual functions).
This changes pci_remove_device_node_info() to look for a parent of
the node being removed, just like pci_add_device_node_info() does when it
references the parent node.
This is the first step to get rid of pci_dn::node.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The eeh_dev struct hold a config space address of an associated node
and the very same address is also stored in the pci_dn struct which
is always present during the eeh_dev lifetime.
This uses bus:devfn directly from pci_dn instead of cached and packed
config_addr.
Since config_addr is made from device's bus:dev.fn, there is no point
in keeping it in the debugfs either so remove that too.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The eeh_dev struct already holds a pointer to pci_dn which it does not
exist without and pci_dn itself holds the very same pointer so just
use it.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch/powerpc/kernel/eeh_dev.c:57 is the only legit place where edev
is allocated; other 2 places allocate it on stack and in the heap for
a very short period of time to use eeh_pe_get() as takes edev.
This changes eeh_pe_get() to receive required parameters explicitly.
This removes unnecessary temporary allocation of edev.
This uses the "pe_no" name instead of the "pe_config_addr" name as
it actually is a PE number and not a config space address as it seemed.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pdev is always NULL, remove it.
To make checkpatch.pl happy, this also removes the "out of memory"
message.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are quite a few machine check exceptions that can be caused by
kernel bugs. To make debugging easier, use the kernel crash path in
cases of synchronous machine checks that occur in kernel mode, if that
would not result in the machine going straight to panic or crash dump.
There is a downside here that die()ing the process in kernel mode can
still leave the system unstable. panic_on_oops will always force the
system to fail-stop, so systems where that behaviour is important will
still do the right thing.
As a test, when triggering an i-side 0111b error (ifetch from foreign
address) in kernel mode process context on POWER9, the kernel currently
dies quickly like this:
Severe Machine check interrupt [Not recovered]
NIP [ffff000000000000]: 0xffff000000000000
Initiator: CPU
Error type: Real address [Instruction fetch (foreign)]
[ 127.426651616,0] OPAL: Reboot requested due to Platform error.
Effective[ 127.426693712,3] OPAL: Reboot requested due to Platform error. address: ffff000000000000
opal: Reboot type 1 not supported
Kernel panic - not syncing: PowerNV Unrecovered Machine Check
CPU: 56 PID: 4425 Comm: syscall Tainted: G M 4.12.0-rc1-13857-ga4700a261072-dirty #35
Call Trace:
[ 128.017988928,4] IPMI: BUG: Dropping ESEL on the floor due to
buggy/mising code in OPAL for this BMC
Rebooting in 10 seconds..
Trying to free IRQ 496 from IRQ context!
After this patch, the process is killed and the kernel continues with
this message, which gives enough information to identify the offending
branch (i.e., with CFAR):
Severe Machine check interrupt [Not recovered]
NIP [ffff000000000000]: 0xffff000000000000
Initiator: CPU
Error type: Real address [Instruction fetch (foreign)]
Effective address: ffff000000000000
Oops: Machine check, sig: 7 [#1]
SMP NR_CPUS=2048
NUMA
PowerNV
Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 ...
CPU: 22 PID: 4436 Comm: syscall Tainted: G M 4.12.0-rc1-13857-ga4700a261072-dirty #36
task: c000000932300000 task.stack: c000000932380000
NIP: ffff000000000000 LR: 00000000217706a4 CTR: ffff000000000000
REGS: c00000000fc8fd80 TRAP: 0200 Tainted: G M (4.12.0-rc1-13857-ga4700a261072-dirty)
MSR: 90000000001c1003 <SF,HV,ME,RI,LE>
CR: 24000484 XER: 20000000
CFAR: c000000000004c80 DAR: 0000000021770a90 DSISR: 0a000000 SOFTE: 1
GPR00: 0000000000001ebe 00007fffce4818b0 0000000021797f00 0000000000000000
GPR04: 00007fff8007ac24 0000000044000484 0000000000004000 00007fff801405e8
GPR08: 900000000280f033 0000000024000484 0000000000000000 0000000000000030
GPR12: 9000000000001003 00007fff801bc370 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR28: 00007fff801b0000 0000000000000000 00000000217707a0 00007fffce481918
NIP [ffff000000000000] 0xffff000000000000
LR [00000000217706a4] 0x217706a4
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A system reset is a request to crash / debug the system rather than
necessarily caused by encountering a BUG. So there is no need to
serialize all CPUs behind the die lock, adding taints to all
subsequent traces beyond the first, breaking console locks, etc.
The system reset is NMI context which has its own printk buffers to
prevent output being interleaved. Then it's better to have all
secondaries print out their debug as quickly as possible and the
primary will flush out all printk buffers during panic().
So remove the 0x100 path from die, and move it into system_reset. Name
the crash/dump reasons "System Reset".
This gives "not tained" traces when crashing an untainted kernel. It
also gives the panic reason as "System Reset" as opposed to "Fatal
exception in interrupt" (or "die oops" for fadump).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If fadump is not registered, and no other crash or debug handlers are
registered, the powerpc panic handler stops the guest before the
generic panic code can push out debug information to the console.
Currently, system reset injection causes the guest to silently stop.
Stop calling ppc_md.panic in the panic notifier. crash_fadump already
does rtas_os_term() to terminate the guest if fadump is registered.
Remove ppc_md.panic. Move fadump panic notifier into fadump code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This fixes a couple more bits of fallout from the new hard lockup watchdog
patch.
It restores the required hw_nmi_get_sample_period() function for the
perf watchdog, and removes some function declarations on 64e that are only
defined for 64s. This fixes the 64e build when the hardlockup detector is
enabled.
It restores the default behaviour of disabling the perf watchdog, and also
fixes disabling the 64s watchdog when running as a guest.
Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Radix MMU does not take SLB or TLB interrupts when accessing kernel
linear address. Remove this restriction for radix mode.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hardware can execute stop in any context, and KVM does not
require real mode because siblings do not share MMU state. This
saves a switch to real-mode when going idle.
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are no longer any callers of IDLE_STATE_ENTER_SEQ, all callers
use IDLE_STATE_ENTER_SEQ_NORET. So drop the former.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch, write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We don't need to use IDLE_STATE_ENTER_SEQ_NORET on Power9.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This macro is only used in idle_book3s.S, move it in there and add a
more descriptive comment.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch and write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 CPUs have independent MMU contexts per thread, so KVM does not
need to quiesce secondary threads, so the hwthread_req/hwthread_state
protocol does not have to be used. So patch it away on POWER9, and patch
away the branch from the Linux idle wakeup to kvm_start_guest that is
never used.
Add a warning and error out of kvmppc_grab_hwthread in case it is ever
called on POWER9.
This avoids a hwsync in the idle wakeup path on POWER9.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Use WARN(...) instead of WARN_ON()/pr_err(...)]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch updates the machine check handler of Linux kernel to
handle the e6500 architecture case. In e6500 core, L1 Data Cache Write
Shadow Mode (DCWS) register is not implemented but L1 data cache always
runs in write shadow mode. So, on L1 data cache parity errors, hardware
will automatically invalidate the data cache but will still log a
machine check interrupt.
Signed-off-by: Ronak Desai <ronak.desai@rockwellcollins.com>
Signed-off-by: Matthew Weber <matthew.weber@rockwellcollins.com>
Signed-off-by: Scott Wood <oss@buserror.net>