Architecturally we should apply a 0x400 offset for these. Not doing
it will break future HW implementations.
The offset of 0 is supposed to remain for "triggers" though not all
sources support both trigger and store EOI, and in P9 specifically,
some sources will treat 0 as a store EOI. But future chips will not.
So this makes us use the properly architected offset which should work
always.
Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The ISA v3.0B copy-paste facility only requires cpabort when switching
to a process that has foreign real addresses mapped (direct access to
accelerators), to clear a potential copy buffer filled by a previous
thread. There is no accelerator driver implemented yet, so cpabort can
be removed. It can be be re-added when a driver is implemented.
POWER9 DD1 requires the copy buffer to always be cleared on context
switch, but if accelerators are not in use, then an unpaired copy from
a dummy region is sufficient to clear data out of the copy buffer.
This increases context switch performance by about 5% on POWER9.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The sync (aka. hwsync, aka. heavyweight sync) in the context switch
code to prevent MMIO access being reordered from the point of view of
a single process if it gets migrated to a different CPU is not
required because there is an hwsync performed earlier in the context
switch path.
Comment this so it's clear enough if anything changes on the scheduler
or the powerpc sides. Remove the hwsync from _switch.
This improves context switch performance by 2-3% on POWER8.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The IBM vNIC protocol provides support for the user to initiate
a failover from the client LPAR in case the current backing infrastructure
is deemed inadequate or in an error state.
Support for two H_VIOCTL sub-commands for vNIC devices are required
to implement this function. These commands are H_GET_SESSION_TOKEN
and H_SESSION_ERR_DETECTED.
"[H_GET_SESSION_TOKEN] is used to obtain a session token from a VNIC client
adapter. This token is opaque to the caller and is intended to be used in
tandem with the SESSION_ERROR_DETECTED vioctl subfunction."
"[H_SESSION_ERR_DETECTED] is used to report that the currently active
backing device for a VNIC client adapter is behaving poorly, and that
the hypervisor should attempt to fail over to a different backing device,
if one is available."
To provide tools access to this functionality the vNIC driver creates a
sysfs file that, when written to, will send a request to pHyp to failover
to a different backing device.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When opening the slave end of a PTY, it is not possible for userspace to
safely ensure that /dev/pts/$num is actually a slave (in cases where the
mount namespace in which devpts was mounted is controlled by an
untrusted process). In addition, there are several unresolvable
race conditions if userspace were to attempt to detect attacks through
stat(2) and other similar methods [in addition it is not clear how
userspace could detect attacks involving FUSE].
Resolve this by providing an interface for userpace to safely open the
"peer" end of a PTY file descriptor by using the dentry cached by
devpts. Since it is not possible to have an open master PTY without
having its slave exposed in /dev/pts this interface is safe. This
interface currently does not provide a way to get the master pty (since
it is not clear whether such an interface is safe or even useful).
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Valentin Rothberg <vrothberg@suse.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Supporting 512TB requires us to do a order 3 allocation for level 1 page
table (pgd). This results in page allocation failures with certain workloads.
For now limit 4k linux page size config to 64TB.
Fixes: f6eedbba7a ("powerpc/mm/hash: Increase VA range to 128TB")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit 8c27226119 ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
switched to the generic implementation of cpu_to_node(), which uses a percpu
variable to hold the NUMA node for each CPU.
Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
of our percpu areas, leading to a chicken and egg problem. In practice what
happens is when we are setting up the percpu areas, cpu_to_node() reports that
all CPUs are on node 0, so we allocate all percpu areas on node 0.
This is visible in the dmesg output, as all pcpu allocs being in group 0:
pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
To fix it we need an early_cpu_to_node() which can run prior to percpu being
setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
in. With the patch dmesg output shows two groups, 0 and 1:
pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
We can also check the data_offset in the paca of various CPUs, with the fix we
see:
CPU 0: data_offset = 0x0ffe8b0000
CPU 24: data_offset = 0x1ffe5b0000
And we can see from dmesg that CPU 24 has an allocation on node 1:
node 0: [mem 0x0000000000000000-0x0000000fffffffff]
node 1: [mem 0x0000001000000000-0x0000001fffffffff]
Cc: stable@vger.kernel.org # v3.16+
Fixes: 8c27226119 ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The i-side 0111b machine check, which is "Instruction Fetch to foreign
address space", was missed by 7b9f71f974 ("powerpc/64s: POWER9 machine
check handler").
The POWER9 processor core considers host real addresses with a
nonzero value in RA(8:12) as foreign address space, accessible only
by the copy and paste instructions. The copy and paste instruction
pair can be used to invoke the Nest accelerators via the Virtual
Accelerator Switchboard (VAS).
It is an error for any regular load/store or ifetch to go to a foreign
addresses. When relocation is on, this causes an MMU exception. When
relocation is off, a machine check exception. It is possible to trigger
this machine check by branching to a foreign address with MSR[IR]=0.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Reported-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These two functions implement the same semantics, so unify their naming so we
can share code that calls them. The longer name is more descriptive so use it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add support in pte_alloc_one() and pgd_alloc() by
passing __GFP_ACCOUNT in the flags
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Introduce a helper pgtable_gfp_flags() which
just returns the current gfp flags and adds
__GFP_ACCOUNT to account for page table allocation.
The generic helper is added to include/asm/pgalloc.h
and has two variants - WARNING ugly bits ahead
1. If the header is included from a module, no check
for mm == &init_mm is done, since init_mm is not
exported
2. For kernel includes, the check is done and required
see (3e79ec7 arch: x86: charge page tables to kmemcg)
The fundamental assumption is that no module should be
doing pgd/pud/pmd and pte alloc's on behalf of init_mm
directly.
NOTE: This adds an overhead to pmd/pud/pgd allocations
similar to x86. The other alternative was to implement
pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel
with their offset variants.
For 4k page size, pte_alloc_one no longer calls
pte_alloc_one_kernel.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Marc Zyngier suggested that we define the arch specific VCPU request
base, rather than requiring each arch to remember to start from 8.
That suggestion, along with Radim Krcmar's recent VCPU request flag
addition, snowballed into defining something of an arch VCPU request
defining API.
No functional change.
(Looks like x86 is running out of arch VCPU request bits. Maybe
someday we'll need to extend to 64.)
Signed-off-by: Andrew Jones <drjones@redhat.com>
Acked-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
By default, 5% of system RAM is reserved for preserving boot memory.
Alternatively, a user can specify the amount of memory to reserve.
See Documentation/powerpc/firmware-assisted-dump.txt for details. In
addition to the memory reserved for preserving boot memory, some more
memory is reserved, to save HPTE region, CPU state data and ELF core
headers.
Memory Reservation during first kernel looks like below:
Low memory Top of memory
0 boot memory size |
| | |<--Reserved dump area -->|
V V | Permanent Reservation V
+-----------+----------/ /----------+---+----+-----------+----+
| | |CPU|HPTE| DUMP |ELF |
+-----------+----------/ /----------+---+----+-----------+----+
| ^
| |
\ /
-------------------------------------------
Boot memory content gets transferred to
reserved area by firmware at the time of
crash
This implicitly means that the sum of the sizes of boot memory, CPU
state data, HPTE region, DUMP preserving area and ELF core headers
can't be greater than the total memory size. But currently, a user is
allowed to specify any value as boot memory size. So, the above rule
is violated when a boot memory size around 50% of the total available
memory is specified. As the kernel is not handling this currently, it
may lead to undefined behavior. Fix it by setting an upper limit for
boot memory size to 25% of the total available memory. Also, instead
of using memblock_end_of_DRAM(), which doesn't take the holes, if any,
in the memory layout into account, use memblock_phys_mem_size() to
calculate the percentage of total available memory.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With the ffz() function as defined in arch/powerpc/include/asm/bitops.h
GCC will not optimise the code in case of constant parameter.
This patch replaces ffz() by the generic function.
The generic ffz(x) expects to never be called with ~x == 0
as written in the comment in include/asm-generic/bitops/ffz.h
The only user of ffz() within arch/powerpc/ is
platforms/512x/mpc5121_ads_cpld.c, which checks if x is not 0xff
For non constant calls, the generated code is doing the same:
unsigned long testffz(unsigned long x)
{
return ffz(x);
}
On PPC32, before the patch:
00000018 <testffz>:
18: 7c 63 18 f9 not. r3,r3
1c: 40 82 00 0c bne 28 <testffz+0x10>
20: 38 60 00 20 li r3,32
24: 4e 80 00 20 blr
28: 7d 23 00 d0 neg r9,r3
2c: 7d 23 18 38 and r3,r9,r3
30: 7c 63 00 34 cntlzw r3,r3
34: 20 63 00 1f subfic r3,r3,31
38: 4e 80 00 20 blr
On PPC32, after the patch:
00000018 <testffz>:
18: 39 23 00 01 addi r9,r3,1
1c: 7d 23 18 78 andc r3,r9,r3
20: 7c 63 00 34 cntlzw r3,r3
24: 20 63 00 1f subfic r3,r3,31
28: 4e 80 00 20 blr
On PPC64, before the patch:
0000000000000030 <.testffz>:
30: 7c 60 18 f9 not. r0,r3
34: 38 60 00 40 li r3,64
38: 4d 82 00 20 beqlr
3c: 7c 60 00 d0 neg r3,r0
40: 7c 63 00 38 and r3,r3,r0
44: 7c 63 00 74 cntlzd r3,r3
48: 20 63 00 3f subfic r3,r3,63
4c: 7c 63 07 b4 extsw r3,r3
50: 4e 80 00 20 blr
On PPC64, after the patch:
0000000000000030 <.testffz>:
30: 38 03 00 01 addi r0,r3,1
34: 7c 03 18 78 andc r3,r0,r3
38: 7c 63 00 74 cntlzd r3,r3
3c: 20 63 00 3f subfic r3,r3,63
40: 4e 80 00 20 blr
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
asm-generic/socket.h already has an exception for the differences that
powerpc needs, so just include it after defining the differences.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are running low on CPU feature bits, so we only want to use them when
it's really necessary.
CPU_FTR_SUBCORE is only used in one place, and only in C, so we don't
need it in order to make asm patching work. It can only be set on
"Power8" CPUs, which in practice means POWER8, POWER8E and POWER8NVL.
There are no plans to implement it on future CPUs, but if there ever
were we could retrofit it then.
Although KVM uses subcores, it never looks at the CPU feature, it either
looks at the ISA level or the threads_per_subcore value.
So drop the CPU feature and do a PVR check instead. Drop the device tree
"subcore" feature as we no longer support doing anything with it, and we
will drop it from skiboot too.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use a tool to check that the location of "fixed sections" are where
we expected them to be, which catches cases the linker script can't
(stubs being added to start of .text section), and which ends up
being neater.
Sample output:
ERROR: start_text address is c000000000008100, should be c000000000008000
ERROR: see comments in arch/powerpc/tools/head_check.sh
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fold in fix from Nick for 4.6 era toolchains]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Very large kernels may require linker stubs for branches from HEAD
text code. The linker may place these stubs before the HEAD text
sections, which breaks the assumption that HEAD text is located at 0
(or the .text section being located at 0x7000/0x8000 on Book3S
kernels).
Provide an option to create a small section just before the .text
section with an empty 256 - 4 bytes, and adjust the start of the .text
section to match. The linker will tend to put stubs in that section
and not break our relative-to-absolute offset assumptions.
This causes a small waste of space on common kernels, but allows large
kernels to build and boot. For now, it is an EXPERT config option,
defaulting to =n, but a reference is provided for it in the build-time
check for such breakage. This is good enough for allyesconfig and
custom users / hackers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The arch version is identical except for comments and white space.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These are completely obvious as all they do is include the asm-generic
versions.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On Power9 DD1 due to a hardware bug the Power-Saving Level Status
field (PLS) of the PSSCR for a thread waking up from a deep state can
under-report if some other thread in the core is in a shallow stop
state. The scenario in which this can manifest is as follows:
1) All the threads of the core are in deep stop.
2) One of the threads is woken up. The PLS for this thread will
correctly reflect that it is waking up from deep stop.
3) The thread that has woken up now executes a shallow stop.
4) When some other thread in the core is woken, its PLS will reflect
the shallow stop state.
Thus, the subsequent thread for which the PLS is under-reporting the
wakeup state will not restore the hypervisor resources.
Hence, on DD1 systems, use the Requested Level (RL) field as a
workaround to restore the contents of the hypervisor resources on the
wakeup from the stop state.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
FIXUP_ENDIAN uses SRR[01] with MSR_RI=1, which gets corrupted if there
is an interleaving system reset or machine check interrupt.
Set MSR_RI=0 before setting SRRs. The rfid will restore MSR.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Providing "scv" support to userspace requires kernel support, so it
must be advertised as independently to the base ISA 3 instruction set.
The darn instruction relies on firmware enablement, so it has been
decided to split this out from the core ISA 3 feature as well.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A definition was only provided for asm-generic/socket.h
using platforms, define it for the others as well
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
virt_addr_valid() is supposed to tell you if it's OK to call virt_to_page() on
an address. What this means in practice is that it should only return true for
addresses in the linear mapping which are backed by a valid PFN.
We are failing to properly check that the address is in the linear mapping,
because virt_to_pfn() will return a valid looking PFN for more or less any
address. That bug is actually caused by __pa(), used in virt_to_pfn().
eg: __pa(0xc000000000010000) = 0x10000 # Good
__pa(0xd000000000010000) = 0x10000 # Bad!
__pa(0x0000000000010000) = 0x10000 # Bad!
This started happening after commit bdbc29c19b ("powerpc: Work around gcc
miscompilation of __pa() on 64-bit") (Aug 2013), where we changed the definition
of __pa() to work around a GCC bug. Prior to that we subtracted PAGE_OFFSET from
the value passed to __pa(), meaning __pa() of a 0xd or 0x0 address would give
you something bogus back.
Until we can verify if that GCC bug is no longer an issue, or come up with
another solution, this commit does the minimal fix to make virt_addr_valid()
work, by explicitly checking that the address is in the linear mapping region.
Fixes: bdbc29c19b ("powerpc: Work around gcc miscompilation of __pa() on 64-bit")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Tested-by: Breno Leitao <breno.leitao@gmail.com>
On powerpc we can build the kernel with two different ABIs for mcount(), which
is used by ftrace. Kernels built with one ABI do not know how to load modules
built with the other ABI. The new style ABI is called "mprofile-kernel", for
want of a better name.
Currently if we build a module using the old style ABI, and the kernel with
mprofile-kernel, when we load the module we'll oops something like:
# insmod autofs4-no-mprofile-kernel.ko
ftrace-powerpc: Unexpected instruction f8810028 around bl _mcount
------------[ cut here ]------------
WARNING: CPU: 6 PID: 3759 at ../kernel/trace/ftrace.c:2024 ftrace_bug+0x2b8/0x3c0
CPU: 6 PID: 3759 Comm: insmod Not tainted 4.11.0-rc3-gcc-5.4.1-00017-g5a61ef74f269 #11
...
NIP [c0000000001eaa48] ftrace_bug+0x2b8/0x3c0
LR [c0000000001eaff8] ftrace_process_locs+0x4a8/0x590
Call Trace:
alloc_pages_current+0xc4/0x1d0 (unreliable)
ftrace_process_locs+0x4a8/0x590
load_module+0x1c8c/0x28f0
SyS_finit_module+0x110/0x140
system_call+0x38/0xfc
...
ftrace failed to modify
[<d000000002a31024>] 0xd000000002a31024
actual: 35:65:00:48
We can avoid this by including in the vermagic whether the kernel/module was
built with mprofile-kernel. Which results in:
# insmod autofs4-pg.ko
autofs4: version magic
'4.11.0-rc3-gcc-5.4.1-00017-g5a61ef74f269 SMP mod_unload modversions '
should be
'4.11.0-rc3-gcc-5.4.1-00017-g5a61ef74f269-dirty SMP mod_unload modversions mprofile-kernel'
insmod: ERROR: could not insert module autofs4-pg.ko: Invalid module format
Fixes: 8c50b72a3b ("powerpc/ftrace: Add Kconfig & Make glue for mprofile-kernel")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Highlights include:
- rework the Linux page table geometry to lower memory usage on 64-bit Book3S
(IBM chips) using the Hash MMU.
- support for a new device tree binding for discovering CPU features on future
firmwares.
- Freescale updates from Scott: "Includes a fix for a powerpc/next mm regression
on 64e, a fix for a kernel hang on 64e when using a debugger inside a
relocated kernel, a qman fix, and misc qe improvements."
Thanks to:
Christophe Leroy, Gavin Shan, Horia Geantă, LiuHailong, Nicholas Piggin, Roy
Pledge, Scott Wood, Valentin Longchamp.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZFXjPAAoJEFHr6jzI4aWAgG4QAJoF7G5Txj0Du2I2/wQDkVq1
InJ+BNji0xnOrFpz2EcIIlbIwBeJbY9cSIbmKUEPQU4hxtQgI8Q5WNEl2btWq8xz
I0Ej3uc5obc9ltUdQoGxgXih/XDd8UN3fscSE2/SSuPY/A7JwAVZMsCEJ1tWdxpM
hx+R9wlaUT3I6jmQwj9gg6zuBdIOL5szvZXKh9ruPKNyZWbPmPSUwIqiyT0YHsiD
01OZsFYpdSH6Ka/eNHSNx5HC+kK8aDVaqd5E2fkHeH9+sxerpEzMo2PmK4T8vChh
mSD4nhfqRwC2WRpPF/MY+zGBeXrFkCkR+nYhaqVDXXACKzfHgU58NOfvrmtRj52X
vTW+cn92wqFTmi0TNUfhEFt8elcOO7/fKh1OVhsFx+bD+bgj8G1ZkLoBU/0QUzRf
R4hiKKuOMnDHriNPdlAOKjHpR+ewh8Q679INThEJzEQpn7VBY72hcQwapQ3MjMnd
E7LfsGwqGPkTc6gy1bFbWum5HMGOcmE0qkrnZo5VyFhNNwBs1Kx/B1GHjUOiucVu
km5GEVNTfCkZqeabdca7fwbGcMH7zchR1ootqH2m18PZJAzr85A+aTqfrdJ5fDBs
v/nznfcPVNEgvEW0im2jhpPoAlQE6/YvYa+kG4zjjxWA5FKVKdTzINexD82jlcqP
+fDtIDxNcFkzlt4gacjh
=YOQs
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"The change to the Linux page table geometry was delayed for more
testing with 16G pages, and there's the new CPU features stuff which
just needed one more polish before going in. Plus a few changes from
Scott which came in a bit late. And then various fixes, mostly minor.
Summary highlights:
- rework the Linux page table geometry to lower memory usage on
64-bit Book3S (IBM chips) using the Hash MMU.
- support for a new device tree binding for discovering CPU features
on future firmwares.
- Freescale updates from Scott:
"Includes a fix for a powerpc/next mm regression on 64e, a fix for
a kernel hang on 64e when using a debugger inside a relocated
kernel, a qman fix, and misc qe improvements."
Thanks to: Christophe Leroy, Gavin Shan, Horia Geantă, LiuHailong,
Nicholas Piggin, Roy Pledge, Scott Wood, Valentin Longchamp"
* tag 'powerpc-4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Support new device tree binding for discovering CPU features
powerpc: Don't print cpu_spec->cpu_name if it's NULL
of/fdt: introduce of_scan_flat_dt_subnodes and of_get_flat_dt_phandle
powerpc/64s: Fix unnecessary machine check handler relocation branch
powerpc/mm/book3s/64: Rework page table geometry for lower memory usage
powerpc: Fix distclean with Makefile.postlink
powerpc/64e: Don't place the stack beyond TASK_SIZE
powerpc/powernv: Block PCI config access on BCM5718 during EEH recovery
powerpc/8xx: Adding support of IRQ in MPC8xx GPIO
soc/fsl/qbman: Disable IRQs for deferred QBMan work
soc/fsl/qe: add EXPORT_SYMBOL for the 2 qe_tdm functions
soc/fsl/qe: only apply QE_General4 workaround on affected SoCs
soc/fsl/qe: round brg_freq to 1kHz granularity
soc/fsl/qe: get rid of immrbar_virt_to_phys()
net: ethernet: ucc_geth: fix MEM_PART_MURAM mode
powerpc/64e: Fix hang when debugging programs with relocated kernel
Improvement of headers_install by Nicolas Dichtel.
It has been long since the introduction of uapi directories,
but the de-coupling of exported headers has not been completed.
Headers listed in header-y are exported whether they exist in
uapi directories or not. His work fixes this inconsistency.
All (and only) headers under uapi directories are now exported.
The asm-generic wrappers are still exceptions, but this is a big
step forward.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZE7MBAAoJED2LAQed4NsGroAP/iARejrIFmxuH96D5h2aiP1j
c8KHQ+5fuq4w2KBmfbfkNvWbazlVheT6RrYWBUh/GABGsSqQC07d8New6B8TaUkE
K0E48RsuYxouP18Ys6BOO4/zyRhEFD7Ta72PGQ/gDQY+6hAu4jYQnMdG0wipTblS
QWgnUxTqfCbTjnRpRKXpcwRff+OeTWtOv3s0V8UashJUxnFVQ7Br2uRsm/KKkU/k
jQC65KyHL4HlsFeeAiMmQ9IQPVwLsd6+d5crs0nydHaJ2XrFlNNQ7EEMyG8FxPdx
9b/VpS+XY6DO+jeqkcpFrdL9IgcmCn72Qc5/4vrHuQO2dpWW5mVaVPq9RAGP0Yq/
FB0vZRTp/tOIkD+0esirZW2gJtU3DWMY1A9rc5jjLRabdnRXVTdLfhEnksYJEfES
yPbDEuKyzo6a+zBSqNtMquJPmYVYEDS2mcmgxY5sB58qtXkUN2Yr+uUALxC8XhXW
SHHwIAV3a+UX5ZU9Ys8dp2hI4EXYXtdvsz2zvl4qPIn/Q9d1YoEJRe7/Y0p8gBXM
5pVJ1yohKoYrNZVGBe0LO/gHGVAVgMj0cKn0Xg51bbvjxY2U5djUbMY0uw1gFrrM
O9ld3C6O8zH5BsExCfwp9iPz2SW5W9N80kgnKfjCHBRUKuMTkm02DJf8Hx+pyfVQ
DCy9lYTi76IgZ1uflKq9
=Rqdo
-----END PGP SIGNATURE-----
Merge tag 'kbuild-uapi-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild UAPI updates from Masahiro Yamada:
"Improvement of headers_install by Nicolas Dichtel.
It has been long since the introduction of uapi directories, but the
de-coupling of exported headers has not been completed. Headers listed
in header-y are exported whether they exist in uapi directories or
not. His work fixes this inconsistency.
All (and only) headers under uapi directories are now exported. The
asm-generic wrappers are still exceptions, but this is a big step
forward"
* tag 'kbuild-uapi-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
arch/include: remove empty Kbuild files
uapi: export all arch specifics directories
uapi: export all headers under uapi directories
smc_diag.h: fix include from userland
btrfs_tree.h: fix include from userland
uapi: includes linux/types.h before exporting files
Makefile.headersinst: remove destination-y option
Makefile.headersinst: cleanup input files
x86: stop exporting msr-index.h to userland
nios2: put setup.h in uapi
h8300: put bitsperlong.h in uapi
Regularly, when a new header is created in include/uapi/, the developer
forgets to add it in the corresponding Kbuild file. This error is usually
detected after the release is out.
In fact, all headers under uapi directories should be exported, thus it's
useless to have an exhaustive list.
After this patch, the following files, which were not exported, are now
exported (with make headers_install_all):
asm-arc/kvm_para.h
asm-arc/ucontext.h
asm-blackfin/shmparam.h
asm-blackfin/ucontext.h
asm-c6x/shmparam.h
asm-c6x/ucontext.h
asm-cris/kvm_para.h
asm-h8300/shmparam.h
asm-h8300/ucontext.h
asm-hexagon/shmparam.h
asm-m32r/kvm_para.h
asm-m68k/kvm_para.h
asm-m68k/shmparam.h
asm-metag/kvm_para.h
asm-metag/shmparam.h
asm-metag/ucontext.h
asm-mips/hwcap.h
asm-mips/reg.h
asm-mips/ucontext.h
asm-nios2/kvm_para.h
asm-nios2/ucontext.h
asm-openrisc/shmparam.h
asm-parisc/kvm_para.h
asm-powerpc/perf_regs.h
asm-sh/kvm_para.h
asm-sh/ucontext.h
asm-tile/shmparam.h
asm-unicore32/shmparam.h
asm-unicore32/ucontext.h
asm-x86/hwcap2.h
asm-xtensa/kvm_para.h
drm/armada_drm.h
drm/etnaviv_drm.h
drm/vgem_drm.h
linux/aspeed-lpc-ctrl.h
linux/auto_dev-ioctl.h
linux/bcache.h
linux/btrfs_tree.h
linux/can/vxcan.h
linux/cifs/cifs_mount.h
linux/coresight-stm.h
linux/cryptouser.h
linux/fsmap.h
linux/genwqe/genwqe_card.h
linux/hash_info.h
linux/kcm.h
linux/kcov.h
linux/kfd_ioctl.h
linux/lightnvm.h
linux/module.h
linux/nbd-netlink.h
linux/nilfs2_api.h
linux/nilfs2_ondisk.h
linux/nsfs.h
linux/pr.h
linux/qrtr.h
linux/rpmsg.h
linux/sched/types.h
linux/sed-opal.h
linux/smc.h
linux/smc_diag.h
linux/stm.h
linux/switchtec_ioctl.h
linux/vfio_ccw.h
linux/wil6210_uapi.h
rdma/bnxt_re-abi.h
Note that I have removed from this list the files which are generated in every
exported directories (like .install or .install.cmd).
Thanks to Julien Floret <julien.floret@6wind.com> for the tip to get all
subdirs with a pure makefile command.
For the record, note that exported files for asm directories are a mix of
files listed by:
- include/uapi/asm-generic/Kbuild.asm;
- arch/<arch>/include/uapi/asm/Kbuild;
- arch/<arch>/include/asm/Kbuild.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Mark Salter <msalter@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
The ibm,powerpc-cpu-features device tree binding describes CPU features with
ASCII names and extensible compatibility, privilege, and enablement metadata
that allows improved flexibility and compatibility with new hardware.
The interface is described in detail in ibm,powerpc-cpu-features.txt in this
patch.
Currently this code is not enabled by default, and there are no released
firmwares that provide the binding.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Freescale updates from Scott:
"Includes a fix for a powerpc/next mm regression on 64e, a fix for a
kernel hang on 64e when using a debugger inside a relocated kernel, a
qman fix, and misc qe improvements."
The main thing here is a new implementation of the in-kernel
XICS interrupt controller emulation for POWER9 machines, from Ben
Herrenschmidt.
POWER9 has a new interrupt controller called XIVE (eXternal Interrupt
Virtualization Engine) which is able to deliver interrupts directly
to guest virtual CPUs in hardware without hypervisor intervention.
With this new code, the guest still sees the old XICS interface but
performance is better because the XICS emulation in the host uses the
XIVE directly rather than going through a XICS emulation in firmware.
Conflicts:
arch/powerpc/kernel/cpu_setup_power.S [cherry-picked fix]
arch/powerpc/kvm/book3s_xive.c [include asm/debugfs.h]
Recently in commit f6eedbba7a ("powerpc/mm/hash: Increase VA range to 128TB")
we increased the virtual address space for user processes to 128TB by default,
and up to 512TB if user space opts in.
This obviously required expanding the range of the Linux page tables. For Book3s
64-bit using hash and with PAGE_SIZE=64K, we increased the PGD to 2^15 entries.
This meant we could cover the full address range, while still being able to
insert a 16G hugepage at the PGD level and a 16M hugepage in the PMD.
The downside of that geometry is that it uses a lot of memory for the PGD, and
in particular makes the PGD a 4-page allocation, which means it's much more
likely to fail under memory pressure.
Instead we can make the PMD larger, so that a single PUD entry maps 16G,
allowing the 16G hugepages to sit at that level in the tree. We're then able to
split the remaining bits between the PUG and PGD. We make the PGD slightly
larger as that results in lower memory usage for typical programs.
When THP is enabled the PMD actually doubles in size, to 2^11 entries, or 2^14
bytes, which is large but still < PAGE_SIZE.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZEHmsAAoJEFmIoMA60/r88SgQAJbFddueb0+DfJ+USDud4b/Z
akfS+G1UAm+TgtMyh1wM49dHzFssp36uWJxtWI+bPqBzuy94PMCbz7JVUV28gX9G
tFhFuc5YH94I/3y85rbZnolb6uZN9MhLjzTFqDC9ilW6HFqmwK4t4wlHSCjQN1St
svLYvs2G6n6/VK3Fre7/wOvdZ1erG4Qod+kn5Tx3K5TQydmRlaSBfK+DRANuDBkM
KzGO7Bkc/Cx8hb9pHmaey/wxmNrrgmVjTtWrEnb2tEq833zP4h6GhUIJEKodMSi5
gXPNZgKlu3n5L592M0UCh4EoHejzkv9wrcsoDm+djmsc5Zg2Howq4kAdHP8k4hUG
0gt8n0ni9vhJN56jikrGi7cAdHCKSNnx2Ue/qTCbX0ncB3XUMuJxJwCsgW/6wa9f
oU7tRtTS03UltnKoFAcyYclS4TaSY4SA4ySaK6Hi+cRkdVFDdyHQYbHHNSU7MsA+
IS2tXvGoIdSYyrZMHSRcl2rRTfYQUkmPEvBF3LvqZr32M4mJMmUNAPLZaly373ZE
iwq0ZJlrLeM0cqdFIG3S60RtJyQk/HBN1NMqrYHArWOxvWIgNd5F8NCsTTxY3wU3
IxgBIuUFcbVwVkqEHGs8K5AvB3oghqdnA3eGOV79799eMtLn3LOvyIlpHMSw9WUq
ags00JtMLitfNPBH3eSl
=eE4D
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- add framework for supporting PCIe devices in Endpoint mode (Kishon
Vijay Abraham I)
- use non-postable PCI config space mappings when possible (Lorenzo
Pieralisi)
- clean up and unify mmap of PCI BARs (David Woodhouse)
- export and unify Function Level Reset support (Christoph Hellwig)
- avoid FLR for Intel 82579 NICs (Sasha Neftin)
- add pci_request_irq() and pci_free_irq() helpers (Christoph Hellwig)
- short-circuit config access failures for disconnected devices (Keith
Busch)
- remove D3 sleep delay when possible (Adrian Hunter)
- freeze PME scan before suspending devices (Lukas Wunner)
- stop disabling MSI/MSI-X in pci_device_shutdown() (Prarit Bhargava)
- disable boot interrupt quirk for ASUS M2N-LR (Stefan Assmann)
- add arch-specific alignment control to improve device passthrough by
avoiding multiple BARs in a page (Yongji Xie)
- add sysfs sriov_drivers_autoprobe to control VF driver binding
(Bodong Wang)
- allow slots below PCI-to-PCIe "reverse bridges" (Bjorn Helgaas)
- fix crashes when unbinding host controllers that don't support
removal (Brian Norris)
- add driver for MicroSemi Switchtec management interface (Logan
Gunthorpe)
- add driver for Faraday Technology FTPCI100 host bridge (Linus
Walleij)
- add i.MX7D support (Andrey Smirnov)
- use generic MSI support for Aardvark (Thomas Petazzoni)
- make Rockchip driver modular (Brian Norris)
- advertise 128-byte Read Completion Boundary support for Rockchip
(Shawn Lin)
- advertise PCI_EXP_LNKSTA_SLC for Rockchip root port (Shawn Lin)
- convert atomic_t to refcount_t in HV driver (Elena Reshetova)
- add CPU IRQ affinity in HV driver (K. Y. Srinivasan)
- fix PCI bus removal in HV driver (Long Li)
- add support for ThunderX2 DMA alias topology (Jayachandran C)
- add ThunderX pass2.x 2nd node MCFG quirk (Tomasz Nowicki)
- add ITE 8893 bridge DMA alias quirk (Jarod Wilson)
- restrict Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices
(Manish Jaggi)
* tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (146 commits)
PCI: Don't allow unbinding host controllers that aren't prepared
ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP
MAINTAINERS: Add PCI Endpoint maintainer
Documentation: PCI: Add userguide for PCI endpoint test function
tools: PCI: Add sample test script to invoke pcitest
tools: PCI: Add a userspace tool to test PCI endpoint
Documentation: misc-devices: Add Documentation for pci-endpoint-test driver
misc: Add host side PCI driver for PCI test function device
PCI: Add device IDs for DRA74x and DRA72x
dt-bindings: PCI: dra7xx: Add DT bindings to enable unaligned access
PCI: dwc: dra7xx: Workaround for errata id i870
dt-bindings: PCI: dra7xx: Add DT bindings for PCI dra7xx EP mode
PCI: dwc: dra7xx: Add EP mode support
PCI: dwc: dra7xx: Facilitate wrapper and MSI interrupts to be enabled independently
dt-bindings: PCI: Add DT bindings for PCI designware EP mode
PCI: dwc: designware: Add EP mode support
Documentation: PCI: Add binding documentation for pci-test endpoint function
ixgbe: Use pcie_flr() instead of duplicating it
IB/hfi1: Use pcie_flr() instead of duplicating it
PCI: imx6: Fix spelling mistake: "contol" -> "control"
...
Merge more updates from Andrew Morton:
- the rest of MM
- various misc things
- procfs updates
- lib/ updates
- checkpatch updates
- kdump/kexec updates
- add kvmalloc helpers, use them
- time helper updates for Y2038 issues. We're almost ready to remove
current_fs_time() but that awaits a btrfs merge.
- add tracepoints to DAX
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
selftests/vm: add a test for virtual address range mapping
dax: add tracepoint to dax_insert_mapping()
dax: add tracepoint to dax_writeback_one()
dax: add tracepoints to dax_writeback_mapping_range()
dax: add tracepoints to dax_load_hole()
dax: add tracepoints to dax_pfn_mkwrite()
dax: add tracepoints to dax_iomap_pte_fault()
mtd: nand: nandsim: convert to memalloc_noreclaim_*()
treewide: convert PF_MEMALLOC manipulations to new helpers
mm: introduce memalloc_noreclaim_{save,restore}
mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
mm/huge_memory.c: use zap_deposited_table() more
time: delete CURRENT_TIME_SEC and CURRENT_TIME
gfs2: replace CURRENT_TIME with current_time
apparmorfs: replace CURRENT_TIME with current_time()
lustre: replace CURRENT_TIME macro
fs: ubifs: replace CURRENT_TIME_SEC with current_time
fs: ufs: use ktime_get_real_ts64() for birthtime
...
Now that crashkernel parameter parsing and vmcoreinfo related code is
moved under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE, remove
dependency with CONFIG_KEXEC for CONFIG_FA_DUMP. While here, get rid of
definitions of fadump_append_elf_note() & fadump_final_note() functions
to reuse similar functions compiled under CONFIG_CRASH_CORE.
Link: http://lkml.kernel.org/r/149035343956.6881.1536459326017709354.stgit@hbathini.in.ibm.com
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
support; virtual interrupt controller performance improvements; support
for userspace virtual interrupt controller (slower, but necessary for
KVM on the weird Broadcom SoCs used by the Raspberry Pi 3)
* MIPS: basic support for hardware virtualization (ImgTec
P5600/P6600/I6400 and Cavium Octeon III)
* PPC: in-kernel acceleration for VFIO
* s390: support for guests without storage keys; adapter interruption
suppression
* x86: usual range of nVMX improvements, notably nested EPT support for
accessed and dirty bits; emulation of CPL3 CPUID faulting
* generic: first part of VCPU thread request API; kvm_stat improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJZEHUkAAoJEL/70l94x66DBeYH/09wrpJ2FjU4Rqv7FxmqgWfH
9WGi4wvn/Z+XzQSyfMJiu2SfZVzU69/Y67OMHudy7vBT6knB+ziM7Ntoiu/hUfbG
0g5KsDX79FW15HuvuuGh9kSjUsj7qsQdyPZwP4FW/6ZoDArV9mibSvdjSmiUSMV/
2wxaoLzjoShdOuCe9EABaPhKK0XCrOYkygT6Paz1pItDxaSn8iW3ulaCuWMprUfG
Niq+dFemK464E4yn6HVD88xg5j2eUM6bfuXB3qR3eTR76mHLgtwejBzZdDjLG9fk
32PNYKhJNomBxHVqtksJ9/7cSR6iNPs7neQ1XHemKWTuYqwYQMlPj1NDy0aslQU=
=IsiZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- HYP mode stub supports kexec/kdump on 32-bit
- improved PMU support
- virtual interrupt controller performance improvements
- support for userspace virtual interrupt controller (slower, but
necessary for KVM on the weird Broadcom SoCs used by the Raspberry
Pi 3)
MIPS:
- basic support for hardware virtualization (ImgTec P5600/P6600/I6400
and Cavium Octeon III)
PPC:
- in-kernel acceleration for VFIO
s390:
- support for guests without storage keys
- adapter interruption suppression
x86:
- usual range of nVMX improvements, notably nested EPT support for
accessed and dirty bits
- emulation of CPL3 CPUID faulting
generic:
- first part of VCPU thread request API
- kvm_stat improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
kvm: nVMX: Don't validate disabled secondary controls
KVM: put back #ifndef CONFIG_S390 around kvm_vcpu_kick
Revert "KVM: Support vCPU-based gfn->hva cache"
tools/kvm: fix top level makefile
KVM: x86: don't hold kvm->lock in KVM_SET_GSI_ROUTING
KVM: Documentation: remove VM mmap documentation
kvm: nVMX: Remove superfluous VMX instruction fault checks
KVM: x86: fix emulation of RSM and IRET instructions
KVM: mark requests that need synchronization
KVM: return if kvm_vcpu_wake_up() did wake up the VCPU
KVM: add explicit barrier to kvm_vcpu_kick
KVM: perform a wake_up in kvm_make_all_cpus_request
KVM: mark requests that do not need a wakeup
KVM: remove #ifndef CONFIG_S390 around kvm_vcpu_wake_up
KVM: x86: always use kvm_make_request instead of set_bit
KVM: add kvm_{test,clear}_request to replace {test,clear}_bit
s390: kvm: Cpu model support for msa6, msa7 and msa8
KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK
kvm: better MWAIT emulation for guests
KVM: x86: virtualize cpuid faulting
...
Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
virtual address space, but a process can request access to the full 512TB by
passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and runtime.
- Several small fixes and cleanups to the kprobes code, as well as support for
KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts, correctly treating
them as NMIs, giving them a dedicated stack and using a new hypervisor call
to trigger them, all of which should aid debugging and robustness.
Many fixes and other minor enhancements.
Thanks to:
Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
ediqGEj0/N1HMB31V5tS
=vSF3
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we
use a 128TB virtual address space, but a process can request access
to the full 512TB by passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator
Interface Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and
runtime.
- Several small fixes and cleanups to the kprobes code, as well as
support for KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts,
correctly treating them as NMIs, giving them a dedicated stack and
using a new hypervisor call to trigger them, all of which should
aid debugging and robustness.
- Many fixes and other minor enhancements.
Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
Yang Shi"
* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
powerpc/powernv: Fix TCE kill on NVLink2
powerpc/mm/radix: Drop support for CPUs without lockless tlbie
powerpc/book3s/mce: Move add_taint() later in virtual mode
powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
powerpc/smp: Document irq enable/disable after migrating IRQs
powerpc/mpc52xx: Don't select user-visible RTAS_PROC
powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
powerpc/eeh: Clean up and document event handling functions
powerpc/eeh: Avoid use after free in eeh_handle_special_event()
cxl: Mask slice error interrupts after first occurrence
cxl: Route eeh events to all drivers in cxl_pci_error_detected()
cxl: Force context lock during EEH flow
powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
powerpc/xmon: Teach xmon oops about radix vectors
powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
powerpc/pseries: Enable VFIO
powerpc/powernv: Fix iommu table size calculation hook for small tables
powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
powerpc: Add arch/powerpc/tools directory
...
Commit f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB") increased
the task size on book3s, and introduced a mechanism to dynamically
control whether a task uses these larger addresses. While the change to
the task size itself was ifdef-protected to only apply on book3s, the
change to STACK_TOP_USER64 was not. On book3e, this had the effect of
trying to use addresses up to 128TiB for the stack despite a 64TiB task
size limit -- which broke 64-bit userspace producing the following errors:
Starting init: /sbin/init exists but couldn't execute it (error -14)
Starting init: /bin/sh exists but couldn't execute it (error -14)
Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Scott Wood <oss@buserror.net>
This patch allows the use of IRQ to notify the change of GPIO status
on MPC8xx CPM IO ports. This then allows to associate IRQs to GPIOs
in the Device Tree.
Ex:
CPM1_PIO_C: gpio-controller@960 {
#gpio-cells = <2>;
compatible = "fsl,cpm1-pario-bank-c";
reg = <0x960 0x10>;
fsl,cpm1-gpio-irq-mask = <0x0fff>;
interrupts = <1 2 6 9 10 11 14 15 23 24 26 31>;
interrupt-parent = <&CPM_PIC>;
gpio-controller;
};
The property 'fsl,cpm1-gpio-irq-mask' defines which of the 16 GPIOs
have the associated interrupts defined in the 'interrupts' property.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Pull livepatch updates from Jiri Kosina:
- a per-task consistency model is being added for architectures that
support reliable stack dumping (extending this, currently rather
trivial set, is currently in the works).
This extends the nature of the types of patches that can be applied
by live patching infrastructure. The code stems from the design
proposal made [1] back in November 2014. It's a hybrid of SUSE's
kGraft and RH's kpatch, combining advantages of both: it uses
kGraft's per-task consistency and syscall barrier switching combined
with kpatch's stack trace switching. There are also a number of
fallback options which make it quite flexible.
Most of the heavy lifting done by Josh Poimboeuf with help from
Miroslav Benes and Petr Mladek
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
- module load time patch optimization from Zhou Chengming
- a few assorted small fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add missing printk newlines
livepatch: Cancel transition a safe way for immediate patches
livepatch: Reduce the time of finding module symbols
livepatch: make klp_mutex proper part of API
livepatch: allow removal of a disabled patch
livepatch: add /proc/<pid>/patch_state
livepatch: change to a per-task consistency model
livepatch: store function sizes
livepatch: use kstrtobool() in enabled_store()
livepatch: move patching functions into patch.c
livepatch: remove unnecessary object loaded check
livepatch: separate enabled and patched states
livepatch/s390: add TIF_PATCH_PENDING thread flag
livepatch/s390: reorganize TIF thread flag bits
livepatch/powerpc: add TIF_PATCH_PENDING thread flag
livepatch/x86: add TIF_PATCH_PENDING thread flag
livepatch: create temporary klp_update_patch_state() stub
x86/entry: define _TIF_ALLWORK_MASK flags explicitly
stacktrace/x86: add function for detecting reliable stack traces
Pull networking updates from David Millar:
"Here are some highlights from the 2065 networking commits that
happened this development cycle:
1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)
2) Add a generic XDP driver, so that anyone can test XDP even if they
lack a networking device whose driver has explicit XDP support
(me).
3) Sparc64 now has an eBPF JIT too (me)
4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
Starovoitov)
5) Make netfitler network namespace teardown less expensive (Florian
Westphal)
6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)
7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)
8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)
9) Multiqueue support in stmmac driver (Joao Pinto)
10) Remove TCP timewait recycling, it never really could possibly work
well in the real world and timestamp randomization really zaps any
hint of usability this feature had (Soheil Hassas Yeganeh)
11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
Aleksandrov)
12) Add socket busy poll support to epoll (Sridhar Samudrala)
13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
and several others)
14) IPSEC hw offload infrastructure (Steffen Klassert)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
tipc: refactor function tipc_sk_recv_stream()
tipc: refactor function tipc_sk_recvmsg()
net: thunderx: Optimize page recycling for XDP
net: thunderx: Support for XDP header adjustment
net: thunderx: Add support for XDP_TX
net: thunderx: Add support for XDP_DROP
net: thunderx: Add basic XDP support
net: thunderx: Cleanup receive buffer allocation
net: thunderx: Optimize CQE_TX handling
net: thunderx: Optimize RBDR descriptor handling
net: thunderx: Support for page recycling
ipx: call ipxitf_put() in ioctl error path
net: sched: add helpers to handle extended actions
qed*: Fix issues in the ptp filter config implementation.
qede: Fix concurrency issue in PTP Tx path processing.
stmmac: Add support for SIMATIC IOT2000 platform
net: hns: fix ethtool_get_strings overflow in hns driver
tcp: fix wraparound issue in tcp_lp
bpf, arm64: fix jit branch offset related to ldimm64
bpf, arm64: implement jiting of BPF_XADD
...
Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:
- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)
- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)
- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)
- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)
- ... plus misc updates, fixes and cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...
Pull x86 asm updates from Ingo Molnar:
"The main changes in this cycle were:
- unwinder fixes and enhancements
- improve ftrace interaction with the unwinder
- optimize the code footprint of WARN() and related debugging
constructs
- ... plus misc updates, cleanups and fixes"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/unwind: Dump all stacks in unwind_dump()
x86/unwind: Silence more entry-code related warnings
x86/ftrace: Fix ebp in ftrace_regs_caller that screws up unwinder
x86/unwind: Remove unused 'sp' parameter in unwind_dump()
x86/unwind: Prepend hex mask value with '0x' in unwind_dump()
x86/unwind: Properly zero-pad 32-bit values in unwind_dump()
x86/unwind: Ensure stack pointer is aligned
debug: Avoid setting BUGFLAG_WARNING twice
x86/unwind: Silence entry-related warnings
x86/unwind: Read stack return address in update_stack_state()
x86/unwind: Move common code into update_stack_state()
debug: Fix __bug_table[] in arch linker scripts
debug: Add _ONCE() logic to report_bug()
x86/debug: Define BUG() again for !CONFIG_BUG
x86/debug: Implement __WARN() using UD0
x86/ftrace: Use Makefile logic instead of #ifdef for compiling ftrace_*.o
x86/ftrace: Add -mfentry support to x86_32 with DYNAMIC_FTRACE set
x86/ftrace: Clean up ftrace_regs_caller
x86/ftrace: Add stack frame pointer to ftrace_caller
x86/ftrace: Move the ftrace specific code out of entry_32.S
...
Pull uaccess unification updates from Al Viro:
"This is the uaccess unification pile. It's _not_ the end of uaccess
work, but the next batch of that will go into the next cycle. This one
mostly takes copy_from_user() and friends out of arch/* and gets the
zero-padding behaviour in sync for all architectures.
Dealing with the nocache/writethrough mess is for the next cycle;
fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
sold on access_ok() in there, BTW; just not in this pile), same for
reducing __copy_... callsites, strn*... stuff, etc. - there will be a
pile about as large as this one in the next merge window.
This one sat in -next for weeks. -3KLoC"
* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
HAVE_ARCH_HARDENED_USERCOPY is unconditional now
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
m32r: switch to RAW_COPY_USER
hexagon: switch to RAW_COPY_USER
microblaze: switch to RAW_COPY_USER
get rid of padding, switch to RAW_COPY_USER
ia64: get rid of copy_in_user()
ia64: sanitize __access_ok()
ia64: get rid of 'segment' argument of __do_{get,put}_user()
ia64: get rid of 'segment' argument of __{get,put}_user_check()
ia64: add extable.h
powerpc: get rid of zeroing, switch to RAW_COPY_USER
esas2r: don't open-code memdup_user()
alpha: fix stack smashing in old_adjtimex(2)
don't open-code kernel_setsockopt()
mips: switch to RAW_COPY_USER
mips: get rid of tail-zeroing in primitives
mips: make copy_from_user() zero tail explicitly
mips: clean and reorder the forest of macros...
mips: consolidate __invoke_... wrappers
...
* pci/resource-mmap:
ia64: Use generic pci_mmap_resource_range()
ia64: Remove redundant checks for WC in pci_mmap_page_range()
ia64: Remove redundant valid_mmap_phys_addr_range() from pci_mmap_page_range()
PCI: Add I/O BAR support to generic pci_mmap_resource_range()
x86/PCI: Use generic pci_mmap_resource_range()
unicore32/PCI: Use generic pci_mmap_resource_range()
sh/PCI: Use generic pci_mmap_resource_range()
parisc: Use generic pci_mmap_resource_range()
mn10300/PCI: Use generic pci_mmap_resource_range()
MIPS: PCI: Use generic pci_mmap_resource_range()
cris/PCI: Use generic pci_mmap_resource_range()
ARM/PCI: Use generic pci_mmap_resource_range()
PCI: Add pci_mmap_resource_range() and use it for ARM64
PCI: Add BAR index argument to pci_mmap_page_range()
PCI: Use BAR index in sysfs attr->private instead of resource pointer
PCI: Add arch_can_pci_mmap_io() on architectures which can mmap() I/O space
PCI: Move multiple declarations of pci_mmap_page_range() to <linux/pci.h>
PCI: Add arch_can_pci_mmap_wc() macro
xtensa/PCI: Do not mmap PCI BARs to userspace as write-through
PCI: Only allow WC mmap on prefetchable resources
PCI: Fix another sanity check bug in /proc/pci mmap
PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
Michal Suchánek noticed a comment in book3s/64/mmu-hash.h about the context ids
we use for the kernel was inconsistent with the code and other comments in the
same file.
It should read 1-4 not 1-5.
While we're touching it, update "address" to "addresses" which makes more sense
as it's referring to more than one address below.
Reported-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Have the NMI IPI code use this op when the platform defines it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a simple NMI IPI system that handles concurrency and reentrancy.
The platform does not have to implement a true non-maskable interrupt,
the default is to simply use the debugger break IPI message. This has
now been co-opted for a general IPI message, and users (debugger and
crash) have been reimplemented on top of the NMI system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Incorporate incremental fixes from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The system reset interrupt is used for crash/debug situations, so it is
desirable to have as little impact on the normal state of the system as
possible.
Currently it uses the current kernel stack to process the exception.
This stores into the stack which may be involved with the crash. The
stack pointer may be corrupted, or it may have overflowed.
Avoid or minimise these problems by creating a dedicated NMI stack for
the system reset interrupt to use.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In preparation for using a dedicated stack for system reset interrupts,
prevent a nested system reset from recovering, in order to simplify
code that is called in crash/debug path. This allows a system reset
interrupt to just use the base stack pointer.
Keep an in_nmi nesting counter similarly to the in_mce counter. Consider
the interrrupt non-recoverable if it is taken inside another system
reset.
Interrupt nesting could be allowed similarly to MCE, but system reset
is a special case that's not for normal operation, so simplicity wins
until there is requirement for nested system reset interrupts.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The system reset interrupt can occur when MSR_EE=0, and it currently
uses the PACA_EXGEN save area.
Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0
when the save area is still in use. A system reset interrupt in this
window can lead to undetected corruption when the save area gets
overwritten.
This patch introduces PACA_EXNMI save area for system reset exceptions,
which closes this corruption window. It's also helpful to retain the
EXGEN state for debugging situations, even if not considering the
recoverability aspect.
This patch also moves the PACA_EXMC area down to a less frequently used
part of the paca with the new save area.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This code is common to a few exceptions, and another user will be added.
This causes a trivial change to generated code:
- 604: std r9,416(r1)
- 608: mfspr r11,314
- 60c: std r11,368(r1)
- 610: mfspr r12,315
+ 604: mfspr r11,314
+ 608: mfspr r12,315
+ 60c: std r9,416(r1)
+ 610: std r11,368(r1)
machine_check_powernv_early could also use this, but that requires non
trivial changes to generated code, so that's for another patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Subsequent patches will add more non-RI variant exceptions, so
create a macro for it rather than open-code it.
This does not change generated instructions.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This merges in the powerpc topic/xive branch to bring in the code for
the in-kernel XICS interrupt controller emulation to use the new XIVE
(eXternal Interrupt Virtualization Engine) hardware in the POWER9 chip
directly, rather than via a XICS emulation in firmware.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
There is also _PAGE_SHARED that is missing.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.
This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc expects IRQs to already be (soft) disabled when switch_mm() is
called, as made clear in the commit message of 9c1e105238 ("powerpc: Allow
perf_counters to access user memory at interrupt time").
Aside from any race conditions that might exist between switch_mm() and an IRQ,
there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't
followed at some point by an IRQ enable then interrupts will remain disabled
until we return to userspace.
It is true that when switch_mm() is called from the scheduler IRQs are off, but
not when it's called by use_mm(). Looking closer we see that last year in commit
f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
this was made more explicit by the addition of switch_mm_irqs_off() which is now
called by the scheduler, vs switch_mm() which is used by use_mm().
Arguably it is a bug in use_mm() to call switch_mm() in a different context than
it expects, but fixing that will take time.
This was discovered recently when vhost started throwing warnings such as:
BUG: sleeping function called from invalid context at kernel/mutex.c:578
in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760
no locks held by vhost-10760/10768.
irq event stamp: 10
hardirqs last enabled at (9): _raw_spin_unlock_irq+0x40/0x80
hardirqs last disabled at (10): switch_slb+0x2e4/0x490
softirqs last enabled at (0): copy_process+0x5e8/0x1260
softirqs last disabled at (0): (null)
Call Trace:
show_stack+0x88/0x390 (unreliable)
dump_stack+0x30/0x44
__might_sleep+0x1c4/0x2d0
mutex_lock_nested+0x74/0x5c0
cgroup_attach_task_all+0x5c/0x180
vhost_attach_cgroups_work+0x58/0x80 [vhost]
vhost_worker+0x24c/0x3d0 [vhost]
kthread+0xec/0x100
ret_from_kernel_thread+0x5c/0xd4
Prior to commit 04b96e5528 ("vhost: lockless enqueuing") (Aug 2016) the
vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(),
which had the effect of reenabling IRQs. Since that commit removed the locking
in vhost_worker() the body of the vhost_worker() loop now runs with interrupts
off causing the warnings.
This patch addresses the problem by making the powerpc code mirror the x86 code,
ie. we disable interrupts in switch_mm(), and optimise the scheduler case by
defining switch_mm_irqs_off().
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Flesh out/rewrite change log, add stable]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Although most of these kprobes patches are powerpc specific, there's a couple
that touch generic code (with Acks). At the moment there's one conflict with
acme's tree, but it's not too bad. Still just in case some other conflicts show
up, we've put these in a topic branch so another tree could merge some or all of
it if necessary.
kprobe_lookup_name() is specific to the kprobe subsystem and may not always
return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE).
For looking up function entry points, introduce a separate helper and use it
in optprobes.c
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Allow kprobes to be placed on ftrace _mcount() call sites. This optimization
avoids the use of a trap, by riding on ftrace infrastructure.
This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL,
which is only currently enabled on powerpc64le with newer toolchains.
Based on the x86 code by Masami.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Blacklist all the exception common/OOL handlers as the kernel stack is not yet
setup, which means we can't take a trap at this point.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Introduce __head_end to mark end of the early fixed sections and use it to
blacklist all exception handlers from kprobes.
mpe: We do not need to do anything special for relocatable kernels, where the
exception vectors are split from the main kernel, as the split vectors are
already excluded by the check for kernel_text_address().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Move __head_end outside #ifdef 64-bit to unbreak the 32-bit build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The idle workaround does not need to load PACATOC, and it does not
need to be called within a nested function that requires LR to be
saved.
Load the PACATOC at entry to the idle wakeup. It does not matter which
PACA this comes from, so it's okay to call before the workaround. Then
apply the workaround to get the right PACA.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If not all threads were in winkle, full state loss recovery is not
necessary and can be avoided. A previous patch removed this optimisation
due to some complexity with the implementation. Re-implement it by
counting the number of threads in winkle with the per-core idle state.
Only restore full state loss if all threads were in winkle.
This has a small window of false positives right before threads execute
winkle and just after they wake up, when the winkle count does not
reflect the true number of threads in winkle. This is not a significant
problem in comparison with even the minimum winkle duration. For
correctness, a false positive is not a problem (only false negatives
would be).
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In preparation for adding more bits to the core idle state word, move
the lock bit up, and unlock by flipping the lock bit rather than masking
off all but the thread bits.
Add branch hints for atomic operations while we're here.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The ISA specifies power save wakeup due to a machine check exception can
cause a machine check interrupt (rather than the usual system reset
interrupt).
The machine check handler copes with this by doing low level machine
check recovery without restoring full state from idle, then queues up a
machine check event for logging, then directly executes the same idle
instruction it woke from. This minimises the work done before recovery
is performed.
The problem is that it requires machine specific instructions and
knowledge of the book3s idle code. Currently it only has code to handle
POWER8 idle, so POWER9 crashes when trying to execute the P8 idle
instructions which don't exist in ISAv3.0B.
cpu 0x0: Vector: e40 (Emulation Assist) at [c0000000008f3810]
pc: c000000000008380: machine_check_handle_early+0x130/0x2f0
lr: c00000000053a098: stop_loop+0x68/0xd0
sp: c0000000008f3a90
msr: 9000000000081001
current = 0xc0000000008a1080
paca = 0xc00000000ffd0000 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/0
Instead of going to sleep after recovery, do the usual idle wakeup and
state restoration by calling into the normal idle wakeup path. This
reuses the normal idle wakeup paths.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The POWER8 idle code has a neat trick of programming the power on engine
to restore a low bit into HSPRG0, so idle wakeup code can test and see
if it has been programmed this way and therefore lost all state. Restore
time can be reduced if winkle has not been reached.
However this messes with our r13 PACA pointer, and requires HSPRG0 to be
written to. It also optimizes the slowest and most uncommon case at the
expense of another SPR write in the common nap state wakeup.
Remove this complexity and assume winkle sleeps always require a state
restore. This speedup could be made entirely contained within the winkle
idle code by counting per-core winkles and setting a thread bitmap when
all have gone to winkle.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The system reset idle handler system_reset_idle_common is relocated, so
relocation is not required to branch to kvm_start_guest. The superfluous
relocation does not result in incorrect code, but it does not compile
outside of exception-64s.S (with fixed section definitions).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Both conflict were simple overlapping changes.
In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
The default implementation of ioremap_cache() is aliased to ioremap().
On powerpc ioremap() creates cache-inhibited mappings by default which
is almost certainly not what you wanted.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The macro is now pretty long and ugly on powerpc. In the light of further
changes needed here, convert it to a __weak variant to be over-ridden with a
nicer looking function.
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hypervisor Virtualization and Directed Hypervisor Doorbell interrupt handlers
use the macro EXC_VIRT_OOL_MASKABLE_HV for their relocation-on handlers, which
calls MASKABLE_RELON_EXCEPTION_HV_OOL, which uses the *real mode* interrupt
prolog. This means we needlessly rfid from virtual mode to virtual mode.
For POWER8 it only affects doorbell IPIs. Context switch microbenchmark between
threads with snooze disabled (which causes IPI) gets about 3% faster, about 370
cycles. Should be more important on POWER9 with global doorbells and HVI for
host interrupts.
Use the RELON variant instead to reduce overhead.
Fixes: 1707dd1613 ("powerpc: Save CFAR before branching in interrupt entry paths")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fold some more detail into the change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.
This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.
To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.
If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.
This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.
This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.
This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reworks helpers for checking TCE update parameters in way they
can be used in KVM.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the commits in the topic/ppc-kvm branch of the powerpc
tree to get the changes to arch/powerpc which subsequent patches will
rely on.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
PR KVM page fault handler performs eaddr to pte translation for a guest,
however kvmppc_mmu_book3s_64_xlate() does not preserve WIMG bits
(storage control) in the kvmppc_pte struct. If PR KVM is running as
a second level guest under HV KVM, and PR KVM tries inserting HPT entry,
this fails in HV KVM if it already has this mapping.
This preserves WIMG bits between kvmppc_mmu_book3s_64_xlate() and
kvmppc_mmu_map_page().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For completeness, this adds emulation of the lfiwax and lfiwzx
instructions. With this, all floating-point load and store instructions
as of Power ISA V2.07 are emulated.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds emulation for the following integer loads and stores,
thus enabling them to be used in a guest for accessing emulated
MMIO locations.
- lhaux
- lwaux
- lwzux
- ldu
- lwa
- stdux
- stwux
- stdu
- ldbrx
- stdbrx
Previously, most of these would cause an emulation failure exit to
userspace, though ldu and lwa got treated incorrectly as ld, and
stdu got treated incorrectly as std.
This also tidies up some of the formatting and updates the comment
listing instructions that still need to be implemented.
With this, all integer loads and stores that are defined in the Power
ISA v2.07 are emulated, except for those that are permitted to trap
when used on cache-inhibited or write-through mappings (and which do
in fact trap on POWER8), that is, lmw/stmw, lswi/stswi, lswx/stswx,
lq/stq, and l[bhwdq]arx/st[bhwdq]cx.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds missing stdx emulation for emulated MMIO accesses by KVM
guests. This allows the Mellanox mlx5_core driver from recent kernels
to work when MMIO emulation is enforced by userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This provides functions that can be used for generating interrupts
indicating that a given functional unit (floating point, vector, or
VSX) is unavailable. These functions will be used in instruction
emulation code.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Override pcibios_default_alignment() to set default alignment to PAGE_SIZE
for all PCI devices on PowerNV platform. Thus sub-page BARs would not
share a page and could be mapped into guest when VFIO passthrough them.
Signed-off-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Power9 DD1 does not implement SAO. Although it's not widely used, its presence
or absence is visible to user space via arch_validate_prot() so it's moderately
important that we get the value right.
Fixes: 7dccfbc325 ("powerpc/book3s: Add a cpu table entry for different POWER9 revs")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Power9 does not implement the icswx instruction. This CPU feature is not visible
to userspace and is only used in the CONFIG_PPC_ICSWX code, which is generally
not enabled, and can only be triggered by other code using icswx, which should
not happen on Power9 systems in the first place. So impact should be minimal.
Fixes: c3ab300ea5 ("powerpc: Add POWER9 cputable entry")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Threshold feature when used with MMCRA [Threshold Event Counter Event],
MMCRA[Threshold Start event] and MMCRA[Threshold End event] will update
MMCRA[Threashold Event Counter Exponent] and MMCRA[Threshold Event
Counter Multiplier] with the corresponding threshold event count values.
Patch to export MMCRA[TECX/TECM] to userspace in 'weight' field of
struct perf_sample_data.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The LDST field and DATA_SRC in SIER identifies the memory hierarchy level
(eg: L1, L2 etc), from which a data-cache miss for a marked instruction
was satisfied. Use the 'perf_mem_data_src' object to export this
hierarchy level to user space.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is relatively esoteric, and knowing that we don't have it makes life
easier in some cases rather than just an eventual -EINVAL from
pci_mmap_page_range().
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
We can declare it <linux/pci.h> even on platforms where it isn't going to
be defined. There's no need to have it littered through the various
<asm/pci.h> files.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>