Commit Graph

5648 Commits

Author SHA1 Message Date
Christophe Leroy 1b5c0967ab powerpc/40x: Remove support for IBM 403GCX
CONFIG_403GCX is not user selectable and is not
selected by any platform.

Remove it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/635f8f5ce9d1f761b3bd8dc3e8ddad500cea26c4.1590079968.git.christophe.leroy@csgroup.eu
2020-05-28 23:24:35 +10:00
Christophe Leroy 4e1df545e2 powerpc/pgtable: Drop PTE_ATOMIC_UPDATES
40x was the last user of PTE_ATOMIC_UPDATES.

Drop everything related to PTE_ATOMIC_UPDATES.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/dbe8438fd1ed3e500132c8ab70269d4e6cc84531.1590079968.git.christophe.leroy@csgroup.eu
2020-05-28 23:24:35 +10:00
Christophe Leroy 2c74e2586b powerpc/40x: Rework 40x PTE access and TLB miss
Commit 1bc54c0311 ("powerpc: rework 4xx PTE access and TLB miss")
reworked 44x PTE access to avoid atomic pte updates, and
left 8xx, 40x and fsl booke with atomic pte updates.
Commit 6cfd8990e2 ("powerpc: rework FSL Book-E PTE access and TLB
miss") removed atomic pte updates on fsl booke.
It went away on 8xx with commit ddfc20a3b9 ("powerpc/8xx: Remove
PTE_ATOMIC_UPDATES").

40x is the last platform setting PTE_ATOMIC_UPDATES.

Rework PTE access and TLB miss to remove PTE_ATOMIC_UPDATES for 40x:
- Always handle DSI as a fault.
- Bail out of TLB miss handler when CONFIG_SWAP is set and
_PAGE_ACCESSED is not set.
- Bail out of ITLB miss handler when _PAGE_EXEC is not set.
- Only set WR bit when both _PAGE_RW and _PAGE_DIRTY are set.
- Remove _PAGE_HWWRITE
- Don't require PTE_ATOMIC_UPDATES anymore

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/99a0fcd337ef67088140d1647d75fea026a70413.1590079968.git.christophe.leroy@csgroup.eu
2020-05-28 23:24:34 +10:00
Michal Simek 7ade8495dc powerpc: Remove Xilinx PPC405/PPC440 support
The latest Xilinx design tools called ISE and EDK has been released in
October 2013. New tool doesn't support any PPC405/PPC440 new designs.
These platforms are no longer supported and tested.

PowerPC 405/440 port is orphan from 2013 by
commit cdeb89943b ("MAINTAINERS: Fix incorrect status tag") and
commit 19624236cc ("MAINTAINERS: Update Grant's email address and maintainership")
that's why it is time to remove the support fot these platforms.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8c593895e2cb57d232d85ce4d8c3a1aa7f0869cc.1590079968.git.christophe.leroy@csgroup.eu
2020-05-28 23:24:34 +10:00
Nicholas Piggin 18594f9b8c powerpc/64s/radix: Don't prefetch DAR in update_mmu_cache
The idea behind this prefetch was to kick off a page table walk before
returning from the fault, getting some pipelining advantage.

But this never showed up any noticable performance advantage, and in
fact with KUAP the prefetches are actually blocked and cause some
kind of micro-architectural fault. Removing this improves page fault
microbenchmark performance by about 9%.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Keep the early return in update_mmu_cache()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200504122907.49304-1-npiggin@gmail.com
2020-05-28 23:24:34 +10:00
Tianjia Zhang 8c99d34578 KVM: PPC: Clean up redundant 'kvm_run' parameters
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-05-27 11:39:31 +10:00
Tianjia Zhang 2610a57f64 KVM: PPC: Remove redundant kvm_run from vcpu_arch
The 'kvm_run' field already exists in the 'vcpu' structure, which
is the same structure as the 'kvm_run' in the 'vcpu_arch' and
should be deleted.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-05-27 11:39:31 +10:00
Michael Ellerman 16ef9767e4 powerpc: Add ppc_inst_as_u64()
The code patching code wants to get the value of a struct ppc_inst as
a u64 when the instruction is prefixed, so we can pass the u64 down to
__put_user_asm() and write it with a single store.

The optprobes code wants to load a struct ppc_inst as an immediate
into a register so it is useful to have it as a u64 to use the
existing helper function.

Currently this is a bit awkward because the value differs based on the
CPU endianness, so add a helper to do the conversion.

This fixes the usage in arch_prepare_optimized_kprobe() which was
previously incorrect on big endian.

Fixes: 650b55b707 ("powerpc: Add prefixed instructions to instruction data type")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Jordan Niethe <jniethe5@gmail.com>
Link: https://lore.kernel.org/r/20200526072630.2487363-1-mpe@ellerman.id.au
2020-05-26 23:36:57 +10:00
Michael Ellerman c5ff46d69c powerpc: Add ppc_inst_next()
In a few places we want to calculate the address of the next
instruction. Previously that was simple, we just added 4 bytes, or if
using a u32 * we incremented that pointer by 1.

But prefixed instructions make it more complicated, we need to advance
by either 4 or 8 bytes depending on the actual instruction. We also
can't do pointer arithmetic using struct ppc_inst, because it is
always 8 bytes in size on 64-bit, even though we might only need to
advance by 4 bytes.

So add a ppc_inst_next() helper which calculates the location of the
next instruction, if the given instruction was located at the given
address. Note the instruction doesn't need to actually be at the
address in memory.

Although it would seem natural for the value to be passed by value,
that makes it too easy to write a loop that will read off the end of a
page, eg:

	for (; src < end; src = ppc_inst_next(src, *src),
			  dest = ppc_inst_next(dest, *dest))

As noticed by Christophe and Jordan, if end is the exact end of a
page, and the next page is not mapped, this will fault, because *dest
will read 8 bytes, 4 bytes into the next page.

So value is passed by reference, so the helper can be careful to use
ppc_inst_read() on it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Jordan Niethe <jniethe5@gmail.com>
Link: https://lore.kernel.org/r/20200522133318.1681406-1-mpe@ellerman.id.au
2020-05-26 23:36:51 +10:00
Michael Ellerman baddc87d68 Merge branch 'fixes' into next
Merge our fixes branch from this cycle. It contains several important
fixes we need in next for testing purposes, and also some that will
conflict with upcoming changes.
2020-05-26 22:56:03 +10:00
Michael Ellerman bb5f33c069 Merge "Use hugepages to map kernel mem on 8xx" into next
Merge Christophe's large series to use huge pages for the linear
mapping on 8xx.

From his cover letter:

The main purpose of this big series is to:
- reorganise huge page handling to avoid using mm_slices.
- use huge pages to map kernel memory on the 8xx.

The 8xx supports 4 page sizes: 4k, 16k, 512k and 8M.
It uses 2 Level page tables, PGD having 1024 entries, each entry
covering 4M address space. Then each page table has 1024 entries.

At the time being, page sizes are managed in PGD entries, implying
the use of mm_slices as it can't mix several pages of the same size
in one page table.

The first purpose of this series is to reorganise things so that
standard page tables can also handle 512k pages. This is done by
adding a new _PAGE_HUGE flag which will be copied into the Level 1
entry in the TLB miss handler. That done, we have 2 types of pages:
- PGD entries to regular page tables handling 4k/16k and 512k pages
- PGD entries to hugepd tables handling 8M pages.

There is no need to mix 8M pages with other sizes, because a 8M page
will use more than what a single PGD covers.

Then comes the second purpose of this series. At the time being, the
8xx has implemented special handling in the TLB miss handlers in order
to transparently map kernel linear address space and the IMMR using
huge pages by building the TLB entries in assembly at the time of the
exception.

As mm_slices is only for user space pages, and also because it would
anyway not be convenient to slice kernel address space, it was not
possible to use huge pages for kernel address space. But after step
one of the series, it is now more flexible to use huge pages.

This series drop all assembly 'just in time' handling of huge pages
and use huge pages in page tables instead.

Once the above is done, then comes icing on the cake:
- Use huge pages for KASAN shadow mapping
- Allow pinned TLBs with strict kernel rwx
- Allow pinned TLBs with debug pagealloc

Then, last but not least, those modifications for the 8xx allows the
following improvement on book3s/32:
- Mapping KASAN shadow with BATs
- Allowing BATs with debug pagealloc

All this allows to considerably simplify TLB miss handlers and associated
initialisation. The overhead of reading page tables is negligible
compared to the reduction of the miss handlers.

While we were at touching pte_update(), some cleanup was done
there too.

Tested widely on 8xx and 832x. Boot tested on QEMU MAC99.
2020-05-26 22:54:27 +10:00
Christophe Leroy 7974c47326 powerpc/32s: Implement dedicated kasan_init_region()
Implement a kasan_init_region() dedicated to book3s/32 that
allocates KASAN regions using BATs.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/709e821602b48a1d7c211a9b156da26db98c3e9d.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:23 +10:00
Christophe Leroy 34536d7806 powerpc/8xx: Add a function to early map kernel via huge pages
Add a function to early map kernel memory using huge pages.

For 512k pages, just use standard page table and map in using 512k
pages.

For 8M pages, create a hugepd table and populate the two PGD
entries with it.

This function can only be used to create page tables at startup. Once
the regular SLAB allocation functions replace memblock functions,
this function cannot allocate new pages anymore. However it can still
update existing mappings with new protections.

hugepd_none() macro is moved into asm/hugetlb.h to be usable outside
of mm/hugetlbpage.c

early_pte_alloc_kernel() is made visible.

_PAGE_HUGE flag is now displayed by ptdump.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Change ptdump display to use "huge"]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/68325bcd3b6f93127f7810418a2352c3519066d6.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:22 +10:00
Christophe Leroy 1251288e64 powerpc/8xx: Remove now unused TLB miss functions
The code to setup linear and IMMR mapping via huge TLB entries is
not called anymore. Remove it.

Also remove the handling of removed code exits in the perf driver.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/75750d25849cb8e73ca519866bb892d7eb9649c0.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:22 +10:00
Christophe Leroy f76c8f6d25 powerpc/8xx: Add function to set pinned TLBs
Pinned TLBs cannot be modified when the MMU is enabled.

Create a function to rewrite the pinned TLB entries with MMU off.

To set pinned TLB, we have to turn off MMU, disable pinning,
do a TLB flush (Either with tlbie and tlbia) then reprogam
the TLB entries, enable pinning and turn on MMU.

If using tlbie, it cleared entries in both instruction and data
TLB regardless whether pinning is disabled or not.
If using tlbia, it clears all entries of the TLB which has
disabled pinning.

To make it easy, just clear all entries in both TLBs, and
reprogram them.

The function takes two arguments, the top of the memory to
consider and whether data is RO under _sinittext.
When DEBUG_PAGEALLOC is set, the top is the end of kernel rodata.
Otherwise, that's the top of physical RAM.

Everything below _sinittext is set RX, over _sinittext that's RW.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c17806014bb1c06513ad1e1d510faea31984b177.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:21 +10:00
Christophe Leroy 555904d07e powerpc/8xx: MM_SLICE is not needed anymore
As the 8xx now manages 512k pages in standard page tables,
it doesn't need CONFIG_PPC_MM_SLICES anymore.

Don't select it anymore and remove all related code.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/98e8ccd424476ea73cced2b89ba38eb2ed8144fb.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:21 +10:00
Christophe Leroy d4870b89ac powerpc/8xx: Only 8M pages are hugepte pages now
512k pages are now standard pages, so only 8M pages
are hugepte.

No more handling of normal page tables through hugepd allocation
and freeing, and hugepte helpers can also be simplified.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2c6135d57fb76eebf70673fbac3dc9e740767879.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:21 +10:00
Christophe Leroy b250c8c08c powerpc/8xx: Manage 512k huge pages as standard pages.
At the time being, 512k huge pages are handled through hugepd page
tables. The PMD entry is flagged as a hugepd pointer and it
means that only 512k hugepages can be managed in that 4M block.
However, the hugepd table has the same size as a normal page
table, and 512k entries can therefore be nested with normal pages.

On the 8xx, TLB loading is performed by software and allthough the
page tables are organised to match the L1 and L2 level defined by
the HW, all TLB entries have both L1 and L2 independent entries.
It means that even if two TLB entries are associated with the same
PMD entry, they can be loaded with different values in L1 part.

The L1 entry contains the page size (PS field):
- 00 for 4k and 16 pages
- 01 for 512k pages
- 11 for 8M pages

By adding a flag for hugepages in the PTE (_PAGE_HUGE) and copying it
into the lower bit of PS, we can then manage 512k pages with normal
page tables:
- PMD entry has PS=11 for 8M pages
- PMD entry has PS=00 for other pages.

As a PMD entry covers 4M areas, a PMD will either point to a hugepd
table having a single entry to an 8M page, or the PMD will point to
a standard page table which will have either entries to 4k or 16k or
512k pages. For 512k pages, as the L1 entry will not know it is a
512k page before the PTE is read, there will be 128 entries in the
PTE as if it was 4k pages. But when loading the TLB, it will be
flagged as a 512k page.

Note that we can't use pmd_ptr() in asm/nohash/32/pgtable.h because
it is not defined yet.

In ITLB miss, we keep the possibility to opt it out as when kernel
text is pinned and no user hugepages are used, we can save several
instruction by not using r11.

In DTLB miss, that's just one instruction so it's not worth bothering
with it.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/002819e8e166bf81d24b24782d98de7c40905d8f.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:21 +10:00
Christophe Leroy a891c43b97 powerpc/8xx: Prepare handlers for _PAGE_HUGE for 512k pages.
Prepare ITLB handler to handle _PAGE_HUGE when CONFIG_HUGETLBFS
is enabled. This means that the L1 entry has to be kept in r11
until L2 entry is read, in order to insert _PAGE_HUGE into it.

Also move pgd_offset helpers before pte_update() as they
will be needed there in next patch.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/21fd1de8fba781bededa9474a5a9374aefb1f849.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:21 +10:00
Christophe Leroy d3efcd38c0 powerpc/8xx: Drop CONFIG_8xx_COPYBACK option
CONFIG_8xx_COPYBACK was there to help disabling copyback cache mode
for debuging hardware. But nobody will design new boards with 8xx now.

All 8xx platforms select it, so make it the default and remove
the option.

Also remove the Mx_RESETVAL values which are pretty useless and hide
the real value while reading code.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bcc968cda075516eb76e2f25e09821f582c566b4.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy b12c07a4bb powerpc/mm: Reduce hugepd size for 8M hugepages on 8xx
Commit 55c8fc3f49 ("powerpc/8xx: reintroduce 16K pages with HW
assistance") redefined pte_t as a struct of 4 pte_basic_t, because
in 16K pages mode there are four identical entries in the page table.
But hugepd entries for 8M pages require only one entry of size
pte_basic_t. So there is no point in creating a cache for 4 entries
page tables.

Calculate PTE_T_ORDER using the size of pte_basic_t instead of pte_t.

Define specific huge_pte helpers (set_huge_pte_at(), huge_pte_clear(),
huge_ptep_set_wrprotect()) to write the pte in a single entry instead
of using set_pte_at() which writes 4 identical entries in 16k pages
mode. Also make sure that __ptep_set_access_flags() properly handle
the huge_pte case.

Define set_pte_filter() inline otherwise GCC doesn't inline it anymore
because it is now used twice, and that gives a pretty suboptimal code
because of pte_t being a struct of 4 entries.

Those functions are also used for 512k pages which only require one
entry as well allthough replicating it four times was harmless as 512k
pages entries are spread every 128 bytes in the table.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/43050d1a0c2d6e1541cab9c1126fc80bc7015ebd.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy 6ad41bfbc9 powerpc/mm: Create a dedicated pte_update() for 8xx
pte_update() is a bit special for the 8xx. At the time
being, that's an #ifdef inside the nohash/32 pte_update().

As we are going to make it even more special in the coming
patches, create a dedicated version for pte_update() for 8xx.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a103be0099ac2360f8c44f4a1a63cc03713a1360.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy 06f5252487 powerpc/mm: Standardise pte_update() prototype between PPC32 and PPC64
PPC64 takes 3 additional parameters compared to PPC32:
- mm
- address
- huge

These 3 parameters will be needed in order to perform different
action depending on the page size on the 8xx.

Make pte_update() prototype identical for PPC32 and PPC64.

This allows dropping an #ifdef in huge_ptep_get_and_clear().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/38111acf6841047a8addde37c63e92d611ee38c2.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy c7fa77016e powerpc/mm: Standardise __ptep_test_and_clear_young() params between PPC32 and PPC64
On PPC32, __ptep_test_and_clear_young() takes the mm->context.id

In preparation of standardising pte_update() params between PPC32 and
PPC64, __ptep_test_and_clear_young() need mm instead of mm->context.id

Replace context param by mm.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0a65470e50a14373b7c2291184514aa982462255.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy 1c1bf29488 powerpc/mm: Refactor pte_update() on book3s/32
When CONFIG_PTE_64BIT is set, pte_update() operates on
'unsigned long long'
When CONFIG_PTE_64BIT is not set, pte_update() operates on
'unsigned long'

In asm/page.h, we have pte_basic_t which is 'unsigned long long'
when CONFIG_PTE_64BIT is set and 'unsigned long' otherwise.

Refactor pte_update() using pte_basic_t.

While we are at it, drop the comment on 44x which is not applicable
to book3s version of pte_update().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c78912bc8613fb249c3d80aeb1062796b5c49400.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy 2db99aeb63 powerpc/mm: Refactor pte_update() on nohash/32
When CONFIG_PTE_64BIT is set, pte_update() operates on
'unsigned long long'
When CONFIG_PTE_64BIT is not set, pte_update() operates on
'unsigned long'

In asm/page.h, we have pte_basic_t which is 'unsigned long long'
when CONFIG_PTE_64BIT is set and 'unsigned long' otherwise.

Refactor pte_update() using pte_basic_t.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/590d67994a2847cd9fe088f7d974499e3a18b6ac.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy fadaac67c9 powerpc/mm: PTE_ATOMIC_UPDATES is only for 40x
Only 40x still uses PTE_ATOMIC_UPDATES.
40x cannot not select CONFIG_PTE64_BIT.

Drop handling of PTE_ATOMIC_UPDATES:
- In nohash/64
- In nohash/32 for CONFIG_PTE_64BIT

Keep PTE_ATOMIC_UPDATES only for nohash/32 for !CONFIG_PTE_64BIT

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d6f8e1f46583f1842de24581a68b0496feb15516.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:20 +10:00
Christophe Leroy 925ac141d1 powerpc/mm: Allocate static page tables for fixmap
Allocate static page tables for the fixmap area. This allows
setting mappings through page tables before memblock is ready.
That's needed to use early_ioremap() early and to use standard
page mappings with fixmap.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4f4b1412d34de6801b8e925cb88fc69d056ff536.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26 22:22:19 +10:00
Linus Torvalds c8347bbf19 powerpc fixes for 5.7 #5
A revert of a recent change to the PTE bits for 32-bit BookS, which broke swap.
 
 And a "fix" to disable STRICT_KERNEL_RWX for 64-bit in Kconfig, as it's causing
 crashes for some people.
 
 Thanks to:
   Christophe Leroy, Rui Salvaterra.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl7H1aMTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgG3sD/9L72cw3cI2TcDTw+OEv2S2TyXrZZc3
 09WyUeerEv8mK8pk1eH8jVk6BqQK9bVZaq/3zYxLr5vnL1m+CZ4Qr8eGy5AcV3AC
 HXiGKhLbEh4btuoN3NwJ6fvEzA85dMTWsowGpgW8JgX1o7rtJmro0XW9EndhZGd2
 WWMBDsWo+RaKODej0c0Bz3TAOVgvxalE1SSLq63Q1sRoPhAZAJ0l8K3ED/EgC+tb
 v/VUi3fQNJngIzlMBc0sNOPp7NgcnDXoozAkW5c2Bp7YURbzeU0oXmsMAxQnyzee
 MP4MY1fAHI3CYdQ7QVRRDpQsTc84bAXVD+te+zhUJejaNm3mWLojRVieYT98eZXi
 iCi4Q0aSuAh3H8rxaYgi9ZemUkSKn+5pLu4kIAyMkBtnTB50E1YqUXVxfPcqk48N
 Y3Fkd6AyZ2/HyxS3bBVAubT/+GxK8HgQNGUBaF7iS50QKd6fl8EKjEBK1tVbYrTj
 xH7lXJpBnLCIj2ygZE1mBLxG8UTLGTfdnpxVNfVkNsLZK4tdsMaQ/llOzVA1uBOY
 twaRAhJkC0RHKHak1KNIQ8gh6HPjqwfg+P6SXHvT347YlTbsKgZei9wHtnZy4lsD
 CAnSImfgJMbzXCoULSoQbgXW0PloRZ1Zz1+WdfxmNjcNsRSqBNoaS1CaPKr7f8to
 a5JEWrUY1D49YQ==
 =yBu+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - a revert of a recent change to the PTE bits for 32-bit BookS, which
   broke swap.

 - a "fix" to disable STRICT_KERNEL_RWX for 64-bit in Kconfig, as it's
   causing crashes for some people.

Thanks to Christophe Leroy and Rui Salvaterra.

* tag 'powerpc-5.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Disable STRICT_KERNEL_RWX
  Revert "powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits."
2020-05-22 08:51:39 -07:00
Christophe Leroy ec97d022f6 powerpc/kasan: Declare kasan_init_region() weak
In order to alloc sub-arches to alloc KASAN regions using optimised
methods (Huge pages on 8xx, BATs on BOOK3S, ...), declare
kasan_init_region() weak.

Also make kasan_init_shadow_page_tables() accessible from outside,
so that it can be called from the specific kasan_init_region()
functions if needed.

And populate remaining KASAN address space only once performed
the region mapping, to allow 8xx to allocate hugepd instead of
standard page tables for mapping via 8M hugepages.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3c1ce419fa1b5a4171b92d7fb16455ca17e1b96d.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20 23:41:03 +10:00
Christophe Leroy d2a91cef9b powerpc/kasan: Fix shadow pages allocation failure
Doing kasan pages allocation in MMU_init is too early, kernel doesn't
have access yet to the entire memory space and memblock_alloc() fails
when the kernel is a bit big.

Do it from kasan_init() instead.

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c24163ee5d5f8cdf52fefa45055ceb35435b8f15.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20 23:41:02 +10:00
Christophe Leroy 3a66a24f60 powerpc/kasan: Fix issues by lowering KASAN_SHADOW_END
At the time being, KASAN_SHADOW_END is 0x100000000, which
is 0 in 32 bits representation.

This leads to a couple of issues:
- kasan_remap_early_shadow_ro() does nothing because the comparison
k_cur < k_end is always false.
- In ptdump, address comparison for markers display fails and the
marker's name is printed at the start of the KASAN area instead of
being printed at the end.

However, there is no need to shadow the KASAN shadow area itself,
so the KASAN shadow area can stop shadowing memory at the start
of itself.

With a PAGE_OFFSET set to 0xc0000000, KASAN shadow area is then going
from 0xf8000000 to 0xff000000.

Fixes: cbd18991e2 ("powerpc/mm: Fix an Oops in kasan_mmu_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ae1a3c0d19a37410c209c3fc453634cfcc0ee318.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20 23:41:01 +10:00
Qian Cai c2e929b18c powerpc/64s/pgtable: fix an undefined behaviour
Booting a power9 server with hash MMU could trigger an undefined
behaviour because pud_offset(p4d, 0) will do,

0 >> (PAGE_SHIFT:16 + PTE_INDEX_SIZE:8 + H_PMD_INDEX_SIZE:10)

Fix it by converting pud_index() and friends to static inline
functions.

UBSAN: shift-out-of-bounds in arch/powerpc/mm/ptdump/ptdump.c:282:15
shift exponent 34 is too large for 32-bit type 'int'
CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-20200303+ #13
Call Trace:
dump_stack+0xf4/0x164 (unreliable)
ubsan_epilogue+0x18/0x78
__ubsan_handle_shift_out_of_bounds+0x160/0x21c
walk_pagetables+0x2cc/0x700
walk_pud at arch/powerpc/mm/ptdump/ptdump.c:282
(inlined by) walk_pagetables at arch/powerpc/mm/ptdump/ptdump.c:311
ptdump_check_wx+0x8c/0xf0
mark_rodata_ro+0x48/0x80
kernel_init+0x74/0x194
ret_from_kernel_thread+0x5c/0x74

Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Link: https://lore.kernel.org/r/20200306044852.3236-1-cai@lca.pw
2020-05-20 23:39:56 +10:00
Nicholas Piggin 9384e552aa powerpc/64s: Fix early_init_mmu section mismatch
Christian reports:

  MODPOST vmlinux.o
  WARNING: modpost: vmlinux.o(.text.unlikely+0x1a0): Section mismatch in
  reference from the function .early_init_mmu() to the function
  .init.text:.radix__early_init_mmu()
  The function .early_init_mmu() references
  the function __init .radix__early_init_mmu().
  This is often because .early_init_mmu lacks a __init
  annotation or the annotation of .radix__early_init_mmu is wrong.

  WARNING: modpost: vmlinux.o(.text.unlikely+0x1ac): Section mismatch in
  reference from the function .early_init_mmu() to the function
  .init.text:.hash__early_init_mmu()
  The function .early_init_mmu() references
  the function __init .hash__early_init_mmu().
  This is often because .early_init_mmu lacks a __init
  annotation or the annotation of .hash__early_init_mmu is wrong.

The compiler is uninlining early_init_mmu and not putting it in an init
section because there is no annotation. Add it.

Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: https://lore.kernel.org/r/20200429070247.1678172-1-npiggin@gmail.com
2020-05-20 23:39:56 +10:00
Michael Ellerman 787a2b682d Merge branch 'topic/ppc-kvm' into next
Merge our topic branch shared with the kvm-ppc tree.

This brings in one commit that touches the XIVE interrupt controller
logic across core and KVM code.
2020-05-20 23:38:13 +10:00
Michael Ellerman 217ba7dcce Merge branch 'topic/uaccess-ppc' into next
Merge our uaccess-ppc topic branch. It is based on the uaccess topic
branch that we're sharing with Viro.

This includes the addition of user_[read|write]_access_begin(), as
well as some powerpc specific changes to our uaccess routines that
would conflict badly if merged separately.
2020-05-20 23:37:33 +10:00
Christophe Leroy 40bb0e9042 Revert "powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits."
This reverts commit 697ece78f8.

The implementation of SWAP on powerpc requires page protection
bits to not be one of the least significant PTE bits.

Until the SWAP implementation is changed and this requirement voids,
we have to keep at least _PAGE_RW outside of the 3 last bits.

For now, revert to previous PTE bits order. A further rework
may come later.

Fixes: 697ece78f8 ("powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.")
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b34706f8de87f84d135abb5f3ede6b6f16fb1f41.1589969799.git.christophe.leroy@csgroup.eu
2020-05-20 22:35:52 +10:00
Paolo Bonzini 9d5272f5e3 Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD 2020-05-20 03:40:09 -04:00
Ravi Bangoria 29da4f91c0 powerpc/watchpoint: Don't allow concurrent perf and ptrace events
With Book3s DAWR, ptrace and perf watchpoints on powerpc behaves
differently. Ptrace watchpoint works in one-shot mode and generates
signal before executing instruction. It's ptrace user's job to
single-step the instruction and re-enable the watchpoint. OTOH, in
case of perf watchpoint, kernel emulates/single-steps the instruction
and then generates event. If perf and ptrace creates two events with
same or overlapping address ranges, it's ambiguous to decide who
should single-step the instruction. Because of this issue, don't
allow perf and ptrace watchpoint at the same time if their address
range overlaps.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-15-ravi.bangoria@linux.ibm.com
2020-05-19 00:14:45 +10:00
Ravi Bangoria 74c6881019 powerpc/watchpoint: Prepare handler to handle more than one watchpoint
Currently we assume that we have only one watchpoint supported by hw.
Get rid of that assumption and use dynamic loop instead. This should
make supporting more watchpoints very easy.

With more than one watchpoint, exception handler needs to know which
DAWR caused the exception, and hw currently does not provide it. So
we need sw logic for the same. To figure out which DAWR caused the
exception, check all different combinations of user specified range,
DAWR address range, actual access range and DAWRX constrains. For ex,
if user specified range and actual access range overlaps but DAWRX is
configured for readonly watchpoint and the instruction is store, this
DAWR must not have caused exception.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Michael Neuling <mikey@neuling.org>
[mpe: Unsplit multi-line printk() strings, fix some sparse warnings]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200514111741.97993-14-ravi.bangoria@linux.ibm.com
2020-05-19 00:14:37 +10:00
Ravi Bangoria e68ef121c1 powerpc/watchpoint: Use builtin ALIGN*() macros
Currently we calculate hw aligned start and end addresses manually.
Replace them with builtin ALIGN_DOWN() and ALIGN() macros.

So far end_addr was inclusive but this patch makes it exclusive (by
avoiding -1) for better readability.

Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-13-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:05 +10:00
Ravi Bangoria 303e6a9ddc powerpc/watchpoint: Convert thread_struct->hw_brk to an array
So far powerpc hw supported only one watchpoint. But Power10 is
introducing 2nd DAWR. Convert thread_struct->hw_brk into an array.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-10-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:05 +10:00
Ravi Bangoria c291913273 powerpc/watchpoint: Get watchpoint count dynamically while disabling them
Instead of disabling only one watchpoint, get num of available
watchpoints dynamically and disable all of them.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-8-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:04 +10:00
Ravi Bangoria 4a8a9379f2 powerpc/watchpoint: Provide DAWR number to __set_breakpoint
Introduce new parameter 'nr' to __set_breakpoint() which indicates
which DAWR should be programed. Also convert current_brk variable
to an array.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-7-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:04 +10:00
Ravi Bangoria a18b834625 powerpc/watchpoint: Provide DAWR number to set_dawr
Introduce new parameter 'nr' to set_dawr() which indicates which DAWR
should be programed.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-6-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:04 +10:00
Ravi Bangoria a6ba44e879 powerpc/watchpoint: Introduce function to get nr watchpoints dynamically
So far we had only one watchpoint, so we have hardcoded HBP_NUM to 1.
But Power10 is introducing 2nd DAWR and thus kernel should be able to
dynamically find actual number of watchpoints supported by hw it's
running on. Introduce function for the same. Also convert HBP_NUM macro
to HBP_NUM_MAX, which will now represent maximum number of watchpoints
supported by Powerpc.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-4-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:04 +10:00
Ravi Bangoria 4a4ec2289a powerpc/watchpoint: Add SPRN macros for second DAWR
Power10 is introducing second DAWR. Add SPRN_ macros for the same.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-3-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:04 +10:00
Ravi Bangoria 09f82b063a powerpc/watchpoint: Rename current DAWR macros
Power10 is introducing second DAWR. Use real register names from ISA
for current macros:
  s/SPRN_DAWR/SPRN_DAWR0/
  s/SPRN_DAWRX/SPRN_DAWRX0/

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-2-ravi.bangoria@linux.ibm.com
2020-05-19 00:11:03 +10:00
Jordan Niethe 50b80a12e4 powerpc sstep: Add support for prefixed load/stores
This adds emulation support for the following prefixed integer
load/stores:
  * Prefixed Load Byte and Zero (plbz)
  * Prefixed Load Halfword and Zero (plhz)
  * Prefixed Load Halfword Algebraic (plha)
  * Prefixed Load Word and Zero (plwz)
  * Prefixed Load Word Algebraic (plwa)
  * Prefixed Load Doubleword (pld)
  * Prefixed Store Byte (pstb)
  * Prefixed Store Halfword (psth)
  * Prefixed Store Word (pstw)
  * Prefixed Store Doubleword (pstd)
  * Prefixed Load Quadword (plq)
  * Prefixed Store Quadword (pstq)

the follow prefixed floating-point load/stores:
  * Prefixed Load Floating-Point Single (plfs)
  * Prefixed Load Floating-Point Double (plfd)
  * Prefixed Store Floating-Point Single (pstfs)
  * Prefixed Store Floating-Point Double (pstfd)

and for the following prefixed VSX load/stores:
  * Prefixed Load VSX Scalar Doubleword (plxsd)
  * Prefixed Load VSX Scalar Single-Precision (plxssp)
  * Prefixed Load VSX Vector [0|1]  (plxv, plxv0, plxv1)
  * Prefixed Store VSX Scalar Doubleword (pstxsd)
  * Prefixed Store VSX Scalar Single-Precision (pstxssp)
  * Prefixed Store VSX Vector [0|1] (pstxv, pstxv0, pstxv1)

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Balamuruhan S <bala24@linux.ibm.com>
[mpe: Use CONFIG_PPC64 not __powerpc64__, use get_op()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-30-jniethe5@gmail.com
2020-05-19 00:11:03 +10:00
Jordan Niethe 650b55b707 powerpc: Add prefixed instructions to instruction data type
For powerpc64, redefine the ppc_inst type so both word and prefixed
instructions can be represented. On powerpc32 the type will remain the
same. Update places which had assumed instructions to be 4 bytes long.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
[mpe: Rework the get_user_inst() macros to be parameterised, and don't
      assign to the dest if an error occurred. Use CONFIG_PPC64 not
      __powerpc64__ in a few places. Address other comments from
      Christophe. Fix some sparse complaints.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-24-jniethe5@gmail.com
2020-05-19 00:10:39 +10:00
Jordan Niethe b691505ef9 powerpc: Define new SRR1 bits for a ISA v3.1
Add the BOUNDARY SRR1 bit definition for when the cause of an
alignment exception is a prefixed instruction that crosses a 64-byte
boundary. Add the PREFIXED SRR1 bit definition for exceptions caused
by prefixed instructions.

Bit 35 of SRR1 is called SRR1_ISI_N_OR_G. This name comes from it
being used to indicate that an ISI was due to the access being no-exec
or guarded. ISA v3.1 adds another purpose. It is also set if there is
an access in a cache-inhibited location for prefixed instruction.
Rename from SRR1_ISI_N_OR_G to SRR1_ISI_N_G_OR_CIP.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-23-jniethe5@gmail.com
2020-05-19 00:10:38 +10:00
Alistair Popple 2aa6195e43 powerpc: Enable Prefixed Instructions
Prefix instructions have their own FSCR bit which needs to enabled via
a CPU feature. The kernel will save the FSCR for problem state but it
needs to be enabled initially.

If prefixed instructions are made unavailable by the [H]FSCR, attempting
to use them will cause a facility unavailable exception. Add "PREFIX" to
the facility_strings[].

Currently there are no prefixed instructions that are actually emulated
by emulate_instruction() within facility_unavailable_exception().
However, when caused by a prefixed instructions the SRR1 PREFIXED bit is
set. Prepare for dealing with emulated prefixed instructions by checking
for this bit.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20200506034050.24806-22-jniethe5@gmail.com
2020-05-19 00:10:38 +10:00
Jordan Niethe 622cf6f436 powerpc: Introduce a function for reporting instruction length
Currently all instructions have the same length, but in preparation for
prefixed instructions introduce a function for returning instruction
length.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-18-jniethe5@gmail.com
2020-05-19 00:10:38 +10:00
Jordan Niethe 5249385ad7 powerpc: Define and use get_user_instr() et. al.
Define specialised get_user_instr(), __get_user_instr() and
__get_user_instr_inatomic() macros for reading instructions from user
and/or kernel space.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
[mpe: Squash in addition of get_user_instr() & __user annotations]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-17-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe 95b980a00d powerpc: Add a probe_kernel_read_inst() function
Introduce a probe_kernel_read_inst() function to use in cases where
probe_kernel_read() is used for getting an instruction. This will be
more useful for prefixed instructions.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
[mpe: Don't write to *inst on error]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-15-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe 7ba68b2172 powerpc: Add a probe_user_read_inst() function
Introduce a probe_user_read_inst() function to use in cases where
probe_user_read() is used for getting an instruction. This will be
more useful for prefixed instructions.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
[mpe: Don't write to *inst on error, fold in __user annotations]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-14-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe f8faaffaa7 powerpc: Use a function for reading instructions
Prefixed instructions will mean there are instructions of different
length. As a result dereferencing a pointer to an instruction will not
necessarily give the desired result. Introduce a function for reading
instructions from memory into the instruction data type.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-13-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe 94afd069d9 powerpc: Use a datatype for instructions
Currently unsigned ints are used to represent instructions on powerpc.
This has worked well as instructions have always been 4 byte words.

However, ISA v3.1 introduces some changes to instructions that mean
this scheme will no longer work as well. This change is Prefixed
Instructions. A prefixed instruction is made up of a word prefix
followed by a word suffix to make an 8 byte double word instruction.
No matter the endianness of the system the prefix always comes first.
Prefixed instructions are only planned for powerpc64.

Introduce a ppc_inst type to represent both prefixed and word
instructions on powerpc64 while keeping it possible to exclusively
have word instructions on powerpc32.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Fix compile error in emulate_spe()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-12-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe 217862d9b9 powerpc: Introduce functions for instruction equality
In preparation for an instruction data type that can not be directly
used with the '==' operator use functions for checking equality.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balamuruhan S <bala24@linux.ibm.com>
Link: https://lore.kernel.org/r/20200506034050.24806-11-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe aabd2233b6 powerpc: Use a function for byte swapping instructions
Use a function for byte swapping instructions in preparation of a more
complicated instruction type.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balamuruhan S <bala24@linux.ibm.com>
Link: https://lore.kernel.org/r/20200506034050.24806-10-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe 8094892d1a powerpc: Use a function for getting the instruction op code
In preparation for using a data type for instructions that can not be
directly used with the '>>' operator use a function for getting the op
code of an instruction.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-9-jniethe5@gmail.com
2020-05-19 00:10:37 +10:00
Jordan Niethe 777e26f0ed powerpc: Use an accessor for instructions
In preparation for introducing a more complicated instruction type to
accommodate prefixed instructions use an accessor for getting an
instruction as a u32.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-8-jniethe5@gmail.com
2020-05-19 00:10:36 +10:00
Jordan Niethe 7534625128 powerpc: Use a macro for creating instructions from u32s
In preparation for instructions having a more complex data type start
using a macro, ppc_inst(), for making an instruction out of a u32.  A
macro is used so that instructions can be used as initializer elements.
Currently this does nothing, but it will allow for creating a data type
that can represent prefixed instructions.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Change include guard to _ASM_POWERPC_INST_H]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-7-jniethe5@gmail.com
2020-05-19 00:10:36 +10:00
Jordan Niethe 7c95d8893f powerpc: Change calling convention for create_branch() et. al.
create_branch(), create_cond_branch() and translate_branch() return the
instruction that they create, or return 0 to signal an error. Separate
these concerns in preparation for an instruction type that is not just
an unsigned int.  Fill the created instruction to a pointer passed as
the first parameter to the function and use a non-zero return value to
signify an error.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-6-jniethe5@gmail.com
2020-05-19 00:10:36 +10:00
Nicholas Piggin f2d7f62e4a powerpc: Implement ftrace_enabled() helpers
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Link: https://lore.kernel.org/r/20200508043408.886394-13-npiggin@gmail.com
2020-05-19 00:10:34 +10:00
Nicholas Piggin 7368b38b21 powerpc/pseries/ras: Avoid calling rtas_token() in NMI paths
In the interest of reducing code and possible failures in the
machine check and system reset paths, grab the "ibm,nmi-interlock"
token at init time.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200508043408.886394-6-npiggin@gmail.com
2020-05-18 21:58:45 +10:00
Linus Torvalds befc42e5dd powerpc fixes for 5.7 #4
A fix for unrecoverable SLB faults in the interrupt exit path, introduced by the
 recent rewrite of interrupt exit in C.
 
 Four fixes for our KUAP (Kernel Userspace Access Prevention) support on 64-bit.
 These are all fairly minor with the exception of the change to evaluate the
 get/put_user() arguments before we enable user access, which reduces the amount
 of code we run with user access enabled.
 
 A fix for our secure boot IMA rules, if enforcement of module signatures is
 enabled at runtime rather than build time.
 
 A fix to our 32-bit VDSO clock_getres() which wasn't falling back to the syscall
 for unknown clocks.
 
 A build fix for CONFIG_PPC_KUAP_DEBUG on 32-bit BookS, and another for 40x.
 
 Thanks to:
   Christophe Leroy, Hugh Dickins, Nicholas Piggin, Aurelien Jarno, Mimi Zohar,
   Nayna Jain.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl6/z6ATHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgKM8D/9ibff532jpCewfwNNXztn3SxRpboQJ
 MgSpnqBlhjoezu84oPwGTPSBYiV/m4hVs5+vUVELKN1cvSmGFkRMoxI80MD3sPEj
 DWoyKmRpG2Uz6tSbU1H/1fA2MEHmG+saYE1g8k0S2m03mRluD4hcSW7jgHiTbFvc
 dC2i/fpf9SloHjEAu/3Wq/j3/6PI8pEoOvZozeLdusipLQgH7zODcW7noV1vF/pK
 BG4me8EVd7QImRRsJrPSlq/292wlSG6G+AwIlEYQbEv4rMqVNzc1WMPlBndHC27B
 ROT+84Ol1eX+jmbq+NTGpZFoWoozmZfTyEhVIQy8IMEiJjhsrh8nBLPP/KiDXjIA
 yujFgP4xfeLbYU+jydBrxun+Kd/DQxbnAIgJ3rJZdbGG2anqt9oWK+88+Y7rnnAY
 wjong1gQVPGr+y0Hj/6ahnXEs2En7W6dLw/EOHDRzazusFRj50a4JGAB1y2WAKHd
 yfa/6KWTUOvMvkkldPiIV1Uz+LFiUUGcHcnx67Fi19lj15W1JzTiHXqs15Ka0n8E
 X2n247d5rD55cS7owqfwBtso5zMlFITuPvr0lMv6hTP0bs0OuTiWCh+od3UeGKyy
 Rxa7piq3/8yB16cqEgbnQmcP0rkUD0WNGQamc2/02buD+GLV6DbgSSSDixUKyVJi
 0/vi7nqaKVbIsg==
 =XkA+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - A fix for unrecoverable SLB faults in the interrupt exit path,
   introduced by the recent rewrite of interrupt exit in C.

 - Four fixes for our KUAP (Kernel Userspace Access Prevention) support
   on 64-bit. These are all fairly minor with the exception of the
   change to evaluate the get/put_user() arguments before we enable user
   access, which reduces the amount of code we run with user access
   enabled.

 - A fix for our secure boot IMA rules, if enforcement of module
   signatures is enabled at runtime rather than build time.

 - A fix to our 32-bit VDSO clock_getres() which wasn't falling back to
   the syscall for unknown clocks.

 - A build fix for CONFIG_PPC_KUAP_DEBUG on 32-bit BookS, and another
   for 40x.

Thanks to: Christophe Leroy, Hugh Dickins, Nicholas Piggin, Aurelien
Jarno, Mimi Zohar, Nayna Jain.

* tag 'powerpc-5.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/40x: Make more space for system call exception
  powerpc/vdso32: Fallback on getres syscall when clock is unknown
  powerpc/32s: Fix build failure with CONFIG_PPC_KUAP_DEBUG
  powerpc/ima: Fix secure boot rules in ima arch policy
  powerpc/64s/kuap: Restore AMR in fast_interrupt_return
  powerpc/64s/kuap: Restore AMR in system reset exception
  powerpc/64/kuap: Move kuap checks out of MSR[RI]=0 regions of exit code
  powerpc/64s: Fix unrecoverable SLB crashes due to preemption check
  powerpc/uaccess: Evaluate macro arguments once, before user access is allowed
2020-05-16 13:34:45 -07:00
Michael Ellerman 24ac99e97f powerpc: Drop unneeded cast in task_pt_regs()
There's no need to cast in task_pt_regs() as tsk->thread.regs should
already be a struct pt_regs. If someone's using task_pt_regs() on
something that's not a task but happens to have a thread.regs then
we'll deal with them later.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200428123152.73566-1-mpe@ellerman.id.au
2020-05-15 11:58:55 +10:00
Michael Ellerman 7ffa8b7dc1 powerpc/64: Don't initialise init_task->thread.regs
Aneesh increased the size of struct pt_regs by 16 bytes and started
seeing this WARN_ON:

  smp: Bringing up secondary CPUs ...
  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 0 at arch/powerpc/kernel/process.c:455 giveup_all+0xb4/0x110
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+ #318
  NIP:  c00000000001a2b4 LR: c00000000001a29c CTR: c0000000031d0000
  REGS: c0000000026d3980 TRAP: 0700   Not tainted  (5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+)
  MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48048224  XER: 00000000
  CFAR: c000000000019cc8 IRQMASK: 1
  GPR00: c00000000001a264 c0000000026d3c20 c0000000026d7200 800000000280b033
  GPR04: 0000000000000001 0000000000000000 0000000000000077 30206d7372203164
  GPR08: 0000000000002000 0000000002002000 800000000280b033 3230303030303030
  GPR12: 0000000000008800 c0000000031d0000 0000000000800050 0000000002000066
  GPR16: 000000000309a1a0 000000000309a4b0 000000000309a2d8 000000000309a890
  GPR20: 00000000030d0098 c00000000264da40 00000000fd620000 c0000000ff798080
  GPR24: c00000000264edf0 c0000001007469f0 00000000fd620000 c0000000020e5e90
  GPR28: c00000000264edf0 c00000000264d200 000000001db60000 c00000000264d200
  NIP [c00000000001a2b4] giveup_all+0xb4/0x110
  LR [c00000000001a29c] giveup_all+0x9c/0x110
  Call Trace:
  [c0000000026d3c20] [c00000000001a264] giveup_all+0x64/0x110 (unreliable)
  [c0000000026d3c90] [c00000000001ae34] __switch_to+0x104/0x480
  [c0000000026d3cf0] [c000000000e0b8a0] __schedule+0x320/0x970
  [c0000000026d3dd0] [c000000000e0c518] schedule_idle+0x38/0x70
  [c0000000026d3df0] [c00000000019c7c8] do_idle+0x248/0x3f0
  [c0000000026d3e70] [c00000000019cbb8] cpu_startup_entry+0x38/0x40
  [c0000000026d3ea0] [c000000000011bb0] rest_init+0xe0/0xf8
  [c0000000026d3ed0] [c000000002004820] start_kernel+0x990/0x9e0
  [c0000000026d3f90] [c00000000000c49c] start_here_common+0x1c/0x400

Which was unexpected. The warning is checking the thread.regs->msr
value of the task we are switching from:

  usermsr = tsk->thread.regs->msr;
  ...
  WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & MSR_VEC)));

ie. if MSR_VSX is set then both of MSR_FP and MSR_VEC are also set.

Dumping tsk->thread.regs->msr we see that it's: 0x1db60000

Which is not a normal looking MSR, in fact the only valid bit is
MSR_VSX, all the other bits are reserved in the current definition of
the MSR.

We can see from the oops that it was swapper/0 that we were switching
from when we hit the warning, ie. init_task. So its thread.regs points
to the base (high addresses) in init_stack.

Dumping the content of init_task->thread.regs, with the members of
pt_regs annotated (the 16 bytes larger version), we see:

  0000000000000000 c000000002780080    gpr[0]     gpr[1]
  0000000000000000 c000000002666008    gpr[2]     gpr[3]
  c0000000026d3ed0 0000000000000078    gpr[4]     gpr[5]
  c000000000011b68 c000000002780080    gpr[6]     gpr[7]
  0000000000000000 0000000000000000    gpr[8]     gpr[9]
  c0000000026d3f90 0000800000002200    gpr[10]    gpr[11]
  c000000002004820 c0000000026d7200    gpr[12]    gpr[13]
  000000001db60000 c0000000010aabe8    gpr[14]    gpr[15]
  c0000000010aabe8 c0000000010aabe8    gpr[16]    gpr[17]
  c00000000294d598 0000000000000000    gpr[18]    gpr[19]
  0000000000000000 0000000000001ff8    gpr[20]    gpr[21]
  0000000000000000 c00000000206d608    gpr[22]    gpr[23]
  c00000000278e0cc 0000000000000000    gpr[24]    gpr[25]
  000000002fff0000 c000000000000000    gpr[26]    gpr[27]
  0000000002000000 0000000000000028    gpr[28]    gpr[29]
  000000001db60000 0000000004750000    gpr[30]    gpr[31]
  0000000002000000 000000001db60000    nip        msr
  0000000000000000 0000000000000000    orig_r3    ctr
  c00000000000c49c 0000000000000000    link       xer
  0000000000000000 0000000000000000    ccr        softe
  0000000000000000 0000000000000000    trap       dar
  0000000000000000 0000000000000000    dsisr      result
  0000000000000000 0000000000000000    ppr        kuap
  0000000000000000 0000000000000000    pad[2]     pad[3]

This looks suspiciously like stack frames, not a pt_regs. If we look
closely we can see return addresses from the stack trace above,
c000000002004820 (start_kernel) and c00000000000c49c (start_here_common).

init_task->thread.regs is setup at build time in processor.h:

  #define INIT_THREAD  { \
  	.ksp = INIT_SP, \
  	.regs = (struct pt_regs *)INIT_SP - 1, /* XXX bogus, I think */ \

The early boot code where we setup the initial stack is:

  LOAD_REG_ADDR(r3,init_thread_union)

  /* set up a stack pointer */
  LOAD_REG_IMMEDIATE(r1,THREAD_SIZE)
  add	r1,r3,r1
  li	r0,0
  stdu	r0,-STACK_FRAME_OVERHEAD(r1)

Which creates a stack frame of size 112 bytes (STACK_FRAME_OVERHEAD).
Which is far too small to contain a pt_regs.

So the result is init_task->thread.regs is pointing at some stack
frames on the init stack, not at a pt_regs.

We have gotten away with this for so long because with pt_regs at its
current size the MSR happens to point into the first frame, at a
location that is not written to by the early asm. With the 16 byte
expansion the MSR falls into the second frame, which is used by the
compiler, and collides with a saved register that tends to be
non-zero.

As far as I can see this has been wrong since the original merge of
64-bit ppc support, back in 2002.

Conceptually swapper should have no regs, it never entered from
userspace, and in fact that's what we do on 32-bit. It's also
presumably what the "bogus" comment is referring to.

So I think the right fix is to just not-initialise regs at all. I'm
slightly worried this will break some code that isn't prepared for a
NULL regs, but we'll have to see.

Remove the comment in head_64.S which refers to us setting up the
regs (even though we never did), and is otherwise not really accurate
any more.

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200428123130.73078-1-mpe@ellerman.id.au
2020-05-15 11:58:54 +10:00
Nicholas Piggin 4e0e45b07d powerpc: Use trap metadata to prevent double restart rather than zeroing trap
It's not very nice to zero trap for this, because then system calls no
longer have trap_is_syscall(regs) invariant, and we can't distinguish
between sc and scv system calls (in a later patch).

Take one last unused bit from the low bits of the pt_regs.trap word
for this instead. There is not a really good reason why it should be
in trap as opposed to another field, but trap has some concept of
flags and it exists. Ideally I think we would move trap to 2-byte
field and have 2 more bytes available independently.

Add a selftests case for this, which can be seen to fail if
trap_norestart() is changed to return false.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make them static inlines]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-4-mpe@ellerman.id.au
2020-05-15 11:58:54 +10:00
Nicholas Piggin 912237ea16 powerpc: trap_is_syscall() helper to hide syscall trap number
A new system call interrupt will be added with a new trap number.
Hide the explicit 0xc00 test behind an accessor to reduce churn
in callers.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make it a static inline]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-3-mpe@ellerman.id.au
2020-05-15 11:58:54 +10:00
Nicholas Piggin db30144b5c powerpc: Use set_trap() and avoid open-coding trap masking
The pt_regs.trap field keeps 4 low bits for some metadata about the
trap or how it was handled, which is masked off in order to test the
architectural trap number.

Add a set_trap() accessor to set this, equivalent to TRAP() for
returning it. This is actually not quite the equivalent of TRAP()
because it always clears the low bits, which may be harmless if
it can only be updated via ptrace syscall, but it seems dangerous.

In fact settting TRAP from ptrace doesn't seem like a great idea
so maybe it's better deleted.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make it a static inline rather than a shouty macro]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-2-mpe@ellerman.id.au
2020-05-15 11:58:54 +10:00
Nicholas Piggin feb9df3462 powerpc/64s: Always has full regs, so remove remnant checks
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-1-mpe@ellerman.id.au
2020-05-15 11:58:53 +10:00
Davidlohr Bueso da4ad88cab kvm: Replace vcpu->swait with rcuwait
The use of any sort of waitqueue (simple or regular) for
wait/waking vcpus has always been an overkill and semantically
wrong. Because this is per-vcpu (which is blocked) there is
only ever a single waiting vcpu, thus no need for any sort of
queue.

As such, make use of the rcuwait primitive, with the following
considerations:

  - rcuwait already provides the proper barriers that serialize
  concurrent waiter and waker.

  - Task wakeup is done in rcu read critical region, with a
  stable task pointer.

  - Because there is no concurrency among waiters, we need
  not worry about rcuwait_wait_event() calls corrupting
  the wait->task. As a consequence, this saves the locking
  done in swait when modifying the queue. This also applies
  to per-vcore wait for powerpc kvm-hv.

The x86 tscdeadline_latency test mentioned in 8577370fb0
("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
latency is reduced by around 15-20% with this change.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-mips@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Message-Id: <20200424054837.5138-6-dave@stgolabs.net>
[Avoid extra logic changes. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:56 -04:00
Willy Tarreau 7fd3463188 floppy: use symbolic register names in the powerpc port
Now we can use FD_STATUS and FD_DATA instead of 4 or 5, let's do
this, and also use STATUS_DMA and STATUS_READY for the status bits.

Link: https://lore.kernel.org/r/20200331094054.24441-6-w@1wt.eu
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12 19:34:53 +03:00
Willy Tarreau e72e8bf1c9 floppy: split the base port from the register in I/O accesses
Currently we have architecture-specific fd_inb() and fd_outb() functions
or macros, taking just a port which is in fact made of a base address and
a register. The base address is FDC-specific and derived from the local or
global "fdc" variable through the FD_IOPORT macro used in the base address
calculation.

This change splits this by explicitly passing the FDC's base address and
the register separately to fd_outb() and fd_inb(). It affects the
following archs:
  - x86, alpha, mips, powerpc, parisc, arm, m68k:
    simple remap of port -> base+reg

  - sparc32: use of reg only, since the base address was already masked
    out and the FDC controller is known from a static struct.

  - sparc64: like x86 for PCI, like sparc32 for 82077

Some archs use inline functions and others macros. This was not
unified in order to minimize the number of changes to review. For the
same reason checkpatch still spews a few warnings about things that
were already there before.

The parisc still uses hard-coded register values and could be cleaned up
by taking the register definitions.

The sparc per-controller inb/outb functions could further be refined
to explicitly take an FDC register instead of a port in argument but it
was not needed yet and may be cleaned later.

Link: https://lore.kernel.org/r/20200331094054.24441-2-w@1wt.eu
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Ian Molton <spyro@f2s.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12 19:34:52 +03:00
Christophe Leroy 4cdb2da654 powerpc: Remove _ALIGN_UP(), _ALIGN_DOWN() and _ALIGN()
These three powerpc macros have been replaced by
equivalent generic macros and are not used anymore.

Remove them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bb0a6081f7b95ee64ca20f92483e5b9661cbacb2.1587407777.git.christophe.leroy@c-s.fr
2020-05-11 23:15:16 +10:00
Christophe Leroy d3f3d3bf76 powerpc: Replace _ALIGN() by ALIGN()
_ALIGN() is specific to powerpc
ALIGN() is generic and does the same

Replace _ALIGN() by ALIGN()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/r/4006d9c8e69f8eaccee954899f6b5fb76240d00b.1587407777.git.christophe.leroy@c-s.fr
2020-05-11 23:15:16 +10:00
Christophe Leroy b711531641 powerpc: Replace _ALIGN_UP() by ALIGN()
_ALIGN_UP() is specific to powerpc
ALIGN() is generic and does the same

Replace _ALIGN_UP() by ALIGN()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/r/8a6d7e45f7904c73a0af539642d3962e2a3c7268.1587407777.git.christophe.leroy@c-s.fr
2020-05-11 23:15:15 +10:00
Christophe Leroy e96d904ede powerpc: Replace _ALIGN_DOWN() by ALIGN_DOWN()
_ALIGN_DOWN() is specific to powerpc
ALIGN_DOWN() is generic and does the same

Replace _ALIGN_DOWN() by ALIGN_DOWN()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/r/3911a86d6b5bfa7ad88cd7c82416fbe6bb47e793.1587407777.git.christophe.leroy@c-s.fr
2020-05-11 23:15:15 +10:00
Michael Ellerman e2a8b49e79 powerpc/uaccess: Don't use "m<>" constraint
The "m<>" constraint breaks compilation with GCC 4.6.x era compilers.

The use of the constraint allows the compiler to use update-form
instructions, however in practice current compilers never generate
those forms for any of the current uses of __put_user_asm_goto().

We anticipate that GCC 4.6 will be declared unsupported for building
the kernel in the not too distant future. So for now just switch to
the "m" constraint.

Fixes: 334710b149 ("powerpc/uaccess: Implement unsafe_put_user() using 'asm goto'")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Link: https://lore.kernel.org/r/20200507123324.2250024-1-mpe@ellerman.id.au
2020-05-08 13:30:42 +10:00
Cédric Le Goater b1f9be9392 powerpc/xive: Enforce load-after-store ordering when StoreEOI is active
When an interrupt has been handled, the OS notifies the interrupt
controller with a EOI sequence. On a POWER9 system using the XIVE
interrupt controller, this can be done with a load or a store
operation on the ESB interrupt management page of the interrupt. The
StoreEOI operation has less latency and improves interrupt handling
performance but it was deactivated during the POWER9 DD2.0 timeframe
because of ordering issues. We use the LoadEOI today but we plan to
reactivate StoreEOI in future architectures.

There is usually no need to enforce ordering between ESB load and
store operations as they should lead to the same result. E.g. a store
trigger and a load EOI can be executed in any order. Assuming the
interrupt state is PQ=10, a store trigger followed by a load EOI will
return a Q bit. In the reverse order, it will create a new interrupt
trigger from HW. In both cases, the handler processing interrupts is
notified.

In some cases, the XIVE_ESB_SET_PQ_10 load operation is used to
disable temporarily the interrupt source (mask/unmask). When the
source is reenabled, the OS can detect if interrupts were received
while the source was disabled and reinject them. This process needs
special care when StoreEOI is activated. The ESB load and store
operations should be correctly ordered because a XIVE_ESB_STORE_EOI
operation could leave the source enabled if it has not completed
before the loads.

For those cases, we enforce Load-after-Store ordering with a special
load operation offset. To avoid performance impact, this ordering is
only enforced when really needed, that is when interrupt sources are
temporarily disabled with the XIVE_ESB_SET_PQ_10 load. It should not
be needed for other loads.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200220081506.31209-1-clg@kaod.org
2020-05-07 22:58:31 +10:00
Christophe Leroy 4833ce06e6 powerpc/32s: Fix build failure with CONFIG_PPC_KUAP_DEBUG
gpr2 is not a parametre of kuap_check(), it doesn't exist.

Use gpr instead.

Fixes: a68c31fc01 ("powerpc/32s: Implement Kernel Userspace Access Protection")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/ea599546f2a7771bde551393889e44e6b2632332.1587368807.git.christophe.leroy@c-s.fr
2020-05-07 17:25:54 +10:00
Michael Ellerman 1f12096aca Merge the lockless page table walk rework into next
This merges the lockless page table walk rework series from Aneesh.
Because it touches powerpc KVM code we are sharing it with the kvm-ppc
tree in our topic/ppc-kvm branch.

This is the cover letter from Aneesh:

Avoid IPI while updating page table entries.

Problem Summary:
Slow termination of KVM guest with large guest RAM config due to a
large number of IPIs that were caused by clearing level 1 PTE
entries (THP) entries. This is shown in the stack trace below.

- qemu-system-ppc  [kernel.vmlinux]            [k] smp_call_function_many
   - smp_call_function_many
      - 36.09% smp_call_function_many
           serialize_against_pte_lookup
           radix__pmdp_huge_get_and_clear
           zap_huge_pmd
           unmap_page_range
           unmap_vmas
           unmap_region
           __do_munmap
           __vm_munmap
           sys_munmap
          system_call
           __munmap
           qemu_ram_munmap
           qemu_anon_ram_free
           reclaim_ramblock
           call_rcu_thread
           qemu_thread_start
           start_thread
           __clone

Why we need to do IPI when clearing PMD entries:
This was added as part of commit: 13bd817bb8 ("powerpc/thp: Serialize pmd clear against a linux page table walk")

serialize_against_pte_lookup makes sure that all parallel lockless
page table walk completes before we convert a PMD pte entry to regular
pmd entry. We end up doing that conversion in the below scenarios

1) __split_huge_zero_page_pmd
2) do_huge_pmd_wp_page_fallback
3) MADV_DONTNEED running parallel to page faults.

local_irq_disable and lockless page table walk:

The lockless page table walk work with the assumption that we can
dereference the page table contents without holding a lock. For this
to work, we need to make sure we read the page table contents
atomically and page table pages are not going to be freed/released
while we are walking the table pages. We can achieve by using a rcu
based freeing for page table pages or if the architecture implements
broadcast tlbie, we can block the IPI as we walk the page table pages.

To support both the above framework, lockless page table walk is done
with irq disabled instead of rcu_read_lock()

We do have two interface for lockless page table walk, gup fast and
__find_linux_pte. This patch series makes __find_linux_pte table walk
safe against the conversion of PMD PTE to regular PMD.

gup fast:

gup fast is already safe against THP split because kernel now
differentiate between a pmd split and a compound page split. gup fast
can run parallel to a pmd split and we prevent a parallel gup fast to
a hugepage split, by freezing the page refcount and failing the
speculative page ref increment.

Similar to how gup is safe against parallel pmd split, this patch
series updates the __find_linux_pte callers to be safe against a
parallel pmd split. We do that by enforcing the following rules.

1) Don't reload the pte value, because that can be updated in
   parallel.
2) Code should be able to work with a stale PTE value and not the
   recent one. ie, the pte value that we are looking at may not be the
   latest value in the page table.
3) Before looking at pte value check for _PAGE_PTE bit. We now do this
as part of pte_present() check.

Performance:

This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:

Without patch:
  munmap start: timer = 13162 ms, PID=7684
  munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms

With patch (upto removing IPI)
  munmap start: timer = 196449 ms, PID=6681
  munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms

With patch (with adding the tlb invalidate in pmdp_huge_get_and_clear_full)
  munmap start: timer = 196345 ms, PID=6879
  munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms

Link: https://lore.kernel.org/r/20200505071729.54912-1-aneesh.kumar@linux.ibm.com
2020-05-06 15:53:24 +10:00
Aneesh Kumar K.V 75358ea359 powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race
MADV_DONTNEED holds mmap_sem in read mode and that implies a
parallel page fault is possible and the kernel can end up with a level 1 PTE
entry (THP entry) converted to a level 0 PTE entry without flushing
the THP TLB entry.

Most architectures including POWER have issues with kernel instantiating a level
0 PTE entry while holding level 1 TLB entries.

The code sequence I am looking at is

down_read(mmap_sem)                         down_read(mmap_sem)

zap_pmd_range()
 zap_huge_pmd()
  pmd lock held
  pmd_cleared
  table details added to mmu_gather
  pmd_unlock()
                                         insert a level 0 PTE entry()

tlb_finish_mmu().

Fix this by forcing a tlb flush before releasing pmd lock if this is
not a fullmm invalidate. We can safely skip this invalidate for
task exit case (fullmm invalidate) because in that case we are sure
there can be no parallel fault handlers.

This do change the Qemu guest RAM del/unplug time as below

128 core, 496GB guest:

Without patch:
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms

With patch:
munmap start: timer = 196345 ms, PID=6879
munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:16 +10:00
Aneesh Kumar K.V 0e11df9649 powerpc/kvm/book3s: Use pte_present instead of opencoding _PAGE_PRESENT check
This adds _PAGE_PTE check and makes sure we validate the pte value returned via
find_kvm_host_pte.

NOTE: this also considers _PAGE_INVALID to the software valid bit.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-20-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:16 +10:00
Aneesh Kumar K.V 35528876a9 powerpc/kvm/book3s: Add helper for host page table walk
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-13-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:15 +10:00
Aneesh Kumar K.V 6cdf30375f powerpc/kvm/book3s: Use kvm helpers to walk shadow or secondary table
update kvmppc_hv_handle_set_rc to use find_kvm_nested_guest_pte and
find_kvm_secondary_pte

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-12-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:15 +10:00
Aneesh Kumar K.V 4b99412ed6 powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
The locking rules for walking partition scoped table is different from process
scoped table. Hence add a helper for secondary linux page table walk and also
add check whether we are holding the right locks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-10-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:15 +10:00
Aneesh Kumar K.V 7900757ce1 powerpc/hash64: Restrict page table lookup using init_mm with __flush_hash_table_range
This is only used with init_mm currently. Walking init_mm is much simpler
because we don't need to handle concurrent page table like other mm_context

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-5-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:14 +10:00
Aneesh Kumar K.V ec4abf1e70 powerpc/mm/hash64: use _PAGE_PTE when checking for pte_present
This makes the pte_present check stricter by checking for additional _PAGE_PTE
bit. A level 1 pte pointer (THP pte) can be switched to a pointer to level 0 pte
page table page by following two operations.

1) THP split.
2) madvise(MADV_DONTNEED) in parallel to page fault.

A lockless page table walk need to make sure we can handle such changes
gracefully.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-4-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:14 +10:00
Aneesh Kumar K.V fe4a6856cb powerpc/pkeys: Avoid using lockless page table walk
Fetch pkey from vma instead of linux page table. Also document the fact that in
some cases the pkey returned in siginfo won't be the same as the one we took
keyfault on. Even with linux page table walk, we can end up in a similar scenario.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-2-aneesh.kumar@linux.ibm.com
2020-05-05 21:20:13 +10:00
Hari Bathini 02c04e374e powerpc/fadump: use static allocation for reserved memory ranges
At times, memory ranges have to be looked up during early boot, when
kernel couldn't be initialized for dynamic memory allocation. In fact,
reserved-ranges look up is needed during FADump memory reservation.
Without accounting for reserved-ranges in reserving memory for FADump,
MPIPL boot fails with memory corruption issues. So, extend memory
ranges handling to support static allocation and populate reserved
memory ranges during early boot.

Fixes: dda9dbfeeb ("powerpc/fadump: consider reserved ranges while releasing memory")
Cc: stable@vger.kernel.org
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/158737294432.26700.4830263187856221314.stgit@hbathini.in.ibm.com
2020-05-04 22:29:58 +10:00
Michael Ellerman 0094368e3b powerpc/64s: Fix unrecoverable SLB crashes due to preemption check
Hugh reported that his trusty G5 crashed after a few hours under load
with an "Unrecoverable exception 380".

The crash is in interrupt_return() where we check lazy_irq_pending(),
which calls get_paca() and with CONFIG_DEBUG_PREEMPT=y that goes to
check_preemption_disabled() via debug_smp_processor_id().

As Nick explained on the list:

  Problem is MSR[RI] is cleared here, ready to do the last few things
  for interrupt return where we're not allowed to take any other
  interrupts.

  SLB interrupts can happen just about anywhere aside from kernel
  text, global variables, and stack. When that hits, it appears to be
  unrecoverable due to RI=0.

The problematic access is in preempt_count() which is:

	return READ_ONCE(current_thread_info()->preempt_count);

Because of THREAD_INFO_IN_TASK, current_thread_info() just points to
current, so the access is to somewhere in kernel memory, but not on
the stack or in .data, which means it can cause an SLB miss. If we
take an SLB miss with RI=0 it is fatal.

The easiest solution is to add a version of lazy_irq_pending() that
doesn't do the preemption check and call it from the interrupt return
path.

Fixes: 68b34588e2 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200502143316.929341-1-mpe@ellerman.id.au
2020-05-04 09:18:06 +10:00
Christophe Leroy 4fe5cda9f8 powerpc/uaccess: Implement user_read_access_begin and user_write_access_begin
Add support for selective read or write user access with
user_read_access_begin/end and user_write_access_begin/end.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6c83af0f0809ef2a955c39ac622767f6cbede035.1585898438.git.christophe.leroy@c-s.fr
2020-05-01 12:37:15 +10:00
Christophe Leroy 17bc43367f powerpc/uaccess: Implement unsafe_copy_to_user() as a simple loop
At the time being, unsafe_copy_to_user() is based on
raw_copy_to_user() which calls __copy_tofrom_user().

__copy_tofrom_user() is a big optimised function to copy big amount
of data. It aligns destinations to cache line in order to use
dcbz instruction.

Today unsafe_copy_to_user() is called only from filldir().
It is used to mainly copy small amount of data like filenames,
so __copy_tofrom_user() is not fit.

Also, unsafe_copy_to_user() is used within user_access_begin/end
sections. In those section, it is preferable to not call functions.

Rewrite unsafe_copy_to_user() as a macro that uses __put_user_goto().
We first perform a loop of long, then we finish with necessary
complements.

unsafe_copy_to_user() might be used in the near future to copy
fixed-size data, like pt_regs structs during signal processing.
Having it as a macro allows GCC to optimise it for instead when
it knows the size in advance, it can unloop loops, drop complements
when the size is a multiple of longs, etc ...

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/fe952112c29bf6a0a2778c9e6bbb4f4afd2c4258.1587143308.git.christophe.leroy@c-s.fr
2020-04-30 20:30:40 +10:00
Christophe Leroy 334710b149 powerpc/uaccess: Implement unsafe_put_user() using 'asm goto'
unsafe_put_user() is designed to take benefit of 'asm goto'.

Instead of using the standard __put_user() approach and branch
based on the returned error, use 'asm goto' and make the
exception code branch directly to the error label. There is
no code anymore in the fixup section.

This change significantly simplifies functions using
unsafe_put_user()

Small exemple of the benefit with the following code:

struct test {
	u32 item1;
	u16 item2;
	u8 item3;
	u64 item4;
};

int set_test_to_user(struct test __user *test, u32 item1, u16 item2, u8 item3, u64 item4)
{
	unsafe_put_user(item1, &test->item1, failed);
	unsafe_put_user(item2, &test->item2, failed);
	unsafe_put_user(item3, &test->item3, failed);
	unsafe_put_user(item4, &test->item4, failed);
	return 0;
failed:
	return -EFAULT;
}

Before the patch:

00000be8 <set_test_to_user>:
 be8:	39 20 00 00 	li      r9,0
 bec:	90 83 00 00 	stw     r4,0(r3)
 bf0:	2f 89 00 00 	cmpwi   cr7,r9,0
 bf4:	40 9e 00 38 	bne     cr7,c2c <set_test_to_user+0x44>
 bf8:	b0 a3 00 04 	sth     r5,4(r3)
 bfc:	2f 89 00 00 	cmpwi   cr7,r9,0
 c00:	40 9e 00 2c 	bne     cr7,c2c <set_test_to_user+0x44>
 c04:	98 c3 00 06 	stb     r6,6(r3)
 c08:	2f 89 00 00 	cmpwi   cr7,r9,0
 c0c:	40 9e 00 20 	bne     cr7,c2c <set_test_to_user+0x44>
 c10:	90 e3 00 08 	stw     r7,8(r3)
 c14:	91 03 00 0c 	stw     r8,12(r3)
 c18:	21 29 00 00 	subfic  r9,r9,0
 c1c:	7d 29 49 10 	subfe   r9,r9,r9
 c20:	38 60 ff f2 	li      r3,-14
 c24:	7d 23 18 38 	and     r3,r9,r3
 c28:	4e 80 00 20 	blr
 c2c:	38 60 ff f2 	li      r3,-14
 c30:	4e 80 00 20 	blr

00000000 <.fixup>:
	...
  b8:	39 20 ff f2 	li      r9,-14
  bc:	48 00 00 00 	b       bc <.fixup+0xbc>
			bc: R_PPC_REL24	.text+0xbf0
  c0:	39 20 ff f2 	li      r9,-14
  c4:	48 00 00 00 	b       c4 <.fixup+0xc4>
			c4: R_PPC_REL24	.text+0xbfc
  c8:	39 20 ff f2 	li      r9,-14
  cc:	48 00 00 00 	b       cc <.fixup+0xcc>
  d0:	39 20 ff f2 	li      r9,-14
  d4:	48 00 00 00 	b       d4 <.fixup+0xd4>
			d4: R_PPC_REL24	.text+0xc18

00000000 <__ex_table>:
	...
			a0: R_PPC_REL32	.text+0xbec
			a4: R_PPC_REL32	.fixup+0xb8
			a8: R_PPC_REL32	.text+0xbf8
			ac: R_PPC_REL32	.fixup+0xc0
			b0: R_PPC_REL32	.text+0xc04
			b4: R_PPC_REL32	.fixup+0xc8
			b8: R_PPC_REL32	.text+0xc10
			bc: R_PPC_REL32	.fixup+0xd0
			c0: R_PPC_REL32	.text+0xc14
			c4: R_PPC_REL32	.fixup+0xd0

After the patch:

00000be8 <set_test_to_user>:
 be8:	90 83 00 00 	stw     r4,0(r3)
 bec:	b0 a3 00 04 	sth     r5,4(r3)
 bf0:	98 c3 00 06 	stb     r6,6(r3)
 bf4:	90 e3 00 08 	stw     r7,8(r3)
 bf8:	91 03 00 0c 	stw     r8,12(r3)
 bfc:	38 60 00 00 	li      r3,0
 c00:	4e 80 00 20 	blr
 c04:	38 60 ff f2 	li      r3,-14
 c08:	4e 80 00 20 	blr

00000000 <__ex_table>:
	...
			a0: R_PPC_REL32	.text+0xbe8
			a4: R_PPC_REL32	.text+0xc04
			a8: R_PPC_REL32	.text+0xbec
			ac: R_PPC_REL32	.text+0xc04
			b0: R_PPC_REL32	.text+0xbf0
			b4: R_PPC_REL32	.text+0xc04
			b8: R_PPC_REL32	.text+0xbf4
			bc: R_PPC_REL32	.text+0xc04
			c0: R_PPC_REL32	.text+0xbf8
			c4: R_PPC_REL32	.text+0xc04

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/23e680624680a9a5405f4b88740d2596d4b17c26.1587143308.git.christophe.leroy@c-s.fr
2020-04-30 20:30:40 +10:00
Nicholas Piggin d02f6b7dab powerpc/uaccess: Evaluate macro arguments once, before user access is allowed
get/put_user() can be called with nontrivial arguments. fs/proc/page.c
has a good example:

    if (put_user(stable_page_flags(ppage), out)) {

stable_page_flags() is quite a lot of code, including spin locks in
the page allocator.

Ensure these arguments are evaluated before user access is allowed.

This improves security by reducing code with access to userspace, but
it also fixes a PREEMPT bug with KUAP on powerpc/64s:
stable_page_flags() is currently called with AMR set to allow writes,
it ends up calling spin_unlock(), which can call preempt_schedule. But
the task switch code can not be called with AMR set (it relies on
interrupts saving the register), so this blows up.

It's fine if the code inside allow_user_access() is preemptible,
because a timer or IPI will save the AMR, but it's not okay to
explicitly cause a reschedule.

Fixes: de78a9c42a ("powerpc: Add a framework for Kernel Userspace Access Protection")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200407041245.600651-1-npiggin@gmail.com
2020-04-30 20:21:44 +10:00
Gautham R. Shenoy 6909f179ca powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files "purr" and "spurr". This patch adds support for exposing
the idle PURR and SPURR ticks via new per-cpu sysfs files named
"idle_purr" and "idle_spurr".

This patch also adds helper functions to accurately read the values of
idle_purr and idle_spurr especially from an interrupt context between
when the interrupt has occurred between the pseries_idle_prolog() and
pseries_idle_epilog(). This will ensure that the idle purr/spurr
values corresponding to the latest idle period is accounted for before
these values are read.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-5-git-send-email-ego@linux.vnet.ibm.com
2020-04-30 12:35:26 +10:00
Gautham R. Shenoy dc8afce5f4 powerpc/pseries: Account for SPURR ticks on idle CPUs
On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle
PURR ticks in the VPA variable "wait_state_cycles". This patch extends
the support to account for the idle SPURR ticks.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-4-git-send-email-ego@linux.vnet.ibm.com
2020-04-30 12:35:26 +10:00