Commit Graph

266 Commits

Author SHA1 Message Date
Nicholas Piggin 94ee42727c powerpc/64s/hash: Simplify slb_flush_and_rebolt()
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so
it can not possibly change the stack, so it should not be touching the
shadow area. And since vmalloc is no longer bolted, it should not
change any bolted mappings at all.

Change the name to slb_flush_and_restore_bolted(), and have it just
load the kernel stack from what's currently in the shadow SLB area.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Christophe Leroy 34eb138ed7 powerpc/mm: don't use _PAGE_EXEC for calling hash_preload()
The 'access' parameter of hash_preload() is either 0 or _PAGE_EXEC.
Among the two versions of hash_preload(), only the PPC64 one is
doing something with this 'access' parameter.

In order to remove the use of _PAGE_EXEC outside platform code,
'access' parameter is replaced by 'is_exec' which will be either
true of false, and the PPC64 version of hash_preload() creates
the access flag based on 'is_exec'.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14 18:04:09 +11:00
Aneesh Kumar K.V da7ad366b4 powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid
pgd/pud/pmd entry. We also switch the p**_present() to look at this bit.

With pmd_present, we have a special case. We need to make sure we consider a
pmd marked invalid during THP split as present. Right now we clear the
_PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special
case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is
only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit
for this special case. pmd_present is also updated to look at _PAGE_INVALID.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:39:58 +10:00
Michael Ellerman 54be0b9c7c Revert "convert SLB miss handlers to C" and subsequent commits
This reverts commits:
  5e46e29e6a ("powerpc/64s/hash: convert SLB miss handlers to C")
  8fed04d0f6 ("powerpc/64s/hash: remove user SLB data from the paca")
  655deecf67 ("powerpc/64s/hash: SLB allocation status bitmaps")
  2e1626744e ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup")
  89ca4e126a ("powerpc/64s/hash: Add a SLB preload cache")

This series had a few bugs, and the fixes are not all trivial. So
revert most of it for now.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:32:49 +10:00
Nicholas Piggin 8fed04d0f6 powerpc/64s/hash: remove user SLB data from the paca
User SLB mappig data is copied into the PACA from the mm->context so
it can be accessed by the SLB miss handlers.

After the C conversion, SLB miss handlers now run with relocation on,
and user SLB misses are able to take recursive kernel SLB misses, so
the user SLB mapping data can be removed from the paca and accessed
directly.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19 22:01:46 +10:00
Christophe Leroy 45ef5992e0 powerpc: remove unnecessary inclusion of asm/tlbflush.h
asm/tlbflush.h is only needed for:
- using functions xxx_flush_tlb_xxx()
- using MMU_NO_CONTEXT
- including asm-generic/pgtable.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-30 22:48:20 +10:00
Aneesh Kumar K.V 1531cff44b powerpc/mm/hash: Remove the superfluous bitwise operation when find hpte group
When computing the starting slot number for a hash page table group we used
to do this
hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;

Multiplying with 8 (HPTES_PER_GROUP) imply the last three bits are 0. Hence we
really don't need to clear then separately.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-24 22:03:17 +10:00
Nicholas Piggin 2bf1071a8d powerpc/64s: Remove POWER9 DD1 support
POWER9 DD1 was never a product. It is no longer supported by upstream
firmware, and it is not effectively supported in Linux due to lack of
testing.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Remove arch_make_huge_pte() entirely]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-16 11:37:21 +10:00
Michael Ellerman 481c63acba Merge branch 'topic/ppc-kvm' into next
Merge in some commits we're sharing with the kvm-ppc tree.
2018-06-03 20:23:54 +10:00
Simon Guo eacbb218fb powerpc: Export tm_enable()/tm_disable/tm_abort() APIs
This patch exports tm_enable()/tm_disable/tm_abort() APIs, which
will be used for PR KVM transactional memory logic.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-24 16:04:02 +10:00
Aneesh Kumar K.V 738f964555 powerpc/mm: Use page fragments for allocation page table at PMD level
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15 22:29:12 +10:00
Aneesh Kumar K.V 8a6c697b99 powerpc/mm: Implement helpers for pagetable fragment support at PMD level
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15 22:29:12 +10:00
Hari Bathini 8597538712 powerpc/fadump: Do not use hugepages when fadump is active
FADump capture kernel boots in restricted memory environment preserving
the context of previous kernel to save vmcore. Supporting hugepages in
such environment makes things unnecessarily complicated, as hugepages
need memory set aside for them. This means most of the capture kernel's
memory is used in supporting hugepages. In most cases, this results in
out-of-memory issues while booting FADump capture kernel. But hugepages
are not of much use in capture kernel whose only job is to save vmcore.
So, disabling hugepages support, when fadump is active, is a reliable
solution for the out of memory issues. Introducing a flag variable to
disable HugeTLB support when fadump is active.

Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-03 23:09:25 +10:00
Nicholas Piggin 471d7ff8b5 powerpc/64s: Remove POWER4 support
POWER4 has been broken since at least the change 49d09bf2a6
("powerpc/64s: Optimise MSR handling in exception handling"), which
requires mtmsrd L=1 support. This was introduced in ISA v2.01, and
POWER4 supports ISA v2.00.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-01 00:47:50 +11:00
Michael Ellerman f437c51748 Merge branch 'topic/paca' into next
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.

This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
2018-03-31 09:09:36 +11:00
Aneesh Kumar K.V f384796c40 powerpc/mm: Add support for handling > 512TB address in SLB miss
For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.

The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
      in get_ea_context().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-31 00:10:38 +11:00
Nicholas Piggin 29ab6c4708 powerpc/mm: Pass node id into create_section_mapping
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Move __map_kernel_page_nid() inside #ifdef SPARSEMEM_VMEMMAP]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-31 00:07:10 +11:00
Paul Mackerras dbfcf3cb9c powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9
On POWER9, since commit cc3d294013 ("powerpc/64: Enable use of radix
MMU under hypervisor on POWER9", 2017-01-30), we set both the radix and
HPT bits in the client-architecture-support (CAS) vector, which tells
the hypervisor that we can do either radix or HPT.  According to PAPR,
if we use this combination we are promising to do a H_REGISTER_PROC_TBL
hcall later on to let the hypervisor know whether we are doing radix
or HPT.  We currently do this call if we are doing radix but not if
we are doing HPT.  If the hypervisor is able to support both radix
and HPT guests, it would be entitled to defer allocation of the HPT
until the H_REGISTER_PROC_TBL call, and to fail any attempts to create
HPTEs until the H_REGISTER_PROC_TBL call.  Thus we need to do a
H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may
crash at boot time.

This adds the code to call H_REGISTER_PROC_TBL in this case, before
we attempt to create any HPT entries using H_ENTER.

Fixes: cc3d294013 ("powerpc/64: Enable use of radix MMU under hypervisor on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-27 19:25:08 +11:00
Christophe Leroy 15472423ce powerpc/mm/slice: Allow up to 64 low slices
While the implementation of the "slices" address space allows
a significant amount of high slices, it limits the number of
low slices to 16 due to the use of a single u64 low_slices_psize
element in struct mm_context_t

On the 8xx, the minimum slice size is the size of the area
covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
16K pages mode. This means we could have at least 64 slices.

In order to override this limitation, this patch switches the
handling of low_slices_psize to char array as done already for
high_slices_psize.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-06 09:21:23 +11:00
Aneesh Kumar K.V fae2211697 powerpc/mm: Fix crashes with 16G huge pages
To support memory keys, we moved the hash pte slot information to the
second half of the page table. This was ok with PTE entries at level
4 (PTE page) and level 3 (PMD). We already allocate larger page table
pages at those levels to accomodate extra details. For level 4 we
already have the extra space which was used to track 4k hash page
table entry details and at level 3 the extra space was allocated to
track the THP details.

With hugetlbfs PTE, we used this extra space at the PMD level to store
the slot details. But we also support hugetlbfs PTE at PUD level for
16GB pages and PUD level page didn't allocate extra space. This
resulted in memory corruption.

Fix this by allocating extra space at PUD level when HUGETLB is
enabled.

Fixes: bf9a95f9a6 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-13 22:37:47 +11:00
David Gibson 7339390d77 powerpc/pseries: Don't give a warning when HPT resizing isn't available
As of 438cc81a41 "powerpc/pseries: Automatically resize HPT for memory hot
add/remove" when running on the pseries platform, we always attempt to
use the PAPR extension to resize the hashed page table (HPT) when we add
or remove memory.

This is fine, but when the extension is available we'll give a harmless,
but scary warning.  This patch suppresses the warning in this case.  It
will still warn if the feature is supposed to be available, but didn't
work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-22 05:48:34 +11:00
Aneesh Kumar K.V 10527e8081 powerpc/hash: Skip non initialized page size in init_hpte_page_sizes
One of the easiest way to test config with 4K HPTE is to disable 64K hardware
page size like below.

int __init htab_dt_scan_page_sizes(unsigned long node,

 		size -= 3; prop += 3;
 		base_idx = get_idx_from_shift(base_shift);
-		if (base_idx < 0) {
+		if (base_idx < 0 || base_idx == MMU_PAGE_64K) {
 			/* skip the pte encoding also */
 			prop += lpnum * 2; size -= lpnum * 2;

But then this results in error in other part of the code such as MPSS parsing
where we look at 4K base page size and 64K actual page size support.

This patch fix MPSS parsing by ignoring the actual page sizes marked
unsupported. In reality this can happen only with a corrupt device tree. But it
is good to tighten the error check.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21 23:37:42 +11:00
Ram Pai 087003e9ef powerpc: introduce get_mm_addr_key() helper
get_mm_addr_key() helper returns the pkey associated with
an address corresponding to a given mm_struct.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:05 +11:00
Ram Pai a6590ca55f powerpc: Program HPTE key protection bits
Map the PTE protection key bits to the HPTE key protection bits,
while creating HPTE  entries.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:03 +11:00
Ram Pai 92e3da3cf1 powerpc: initial pkey plumbing
Basic  plumbing  to   initialize  the   pkey  system.
Nothing is enabled yet. A later patch will enable it
once all the infrastructure is in place.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Rework copyrights to use SPDX tags]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 21:45:03 +11:00
Nicholas Piggin c610d65c0a powerpc/pseries: lift RTAS limit for hash
With the previous patch to switch to 64-bit mode after returning from
RTAS and before doing any memory accesses, the RMA limit need not be
clamped to 1GB to avoid RTAS bugs.

Keep the 1GB limit for older firmware (although this is more of a kernel
concern than RTAS), and remove it starting with POWER9.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:45:34 +11:00
Nicholas Piggin 1513c33d71 powerpc/powernv: Remove real mode access limit for early allocations
This removes the RMA limit on powernv platform, which constrains
early allocations such as PACAs and stacks. There are still other
restrictions that must be followed, such as bolted SLB limits, but
real mode addressing has no constraints.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:41:44 +11:00
Nicholas Piggin d4748276ae powerpc/64s: Improve local TLB flush for boot and MCE on POWER9
There are several cases outside the normal address space management
where a CPU's entire local TLB is to be flushed:

  1. Booting the kernel, in case something has left stale entries in
     the TLB (e.g., kexec).

  2. Machine check, to clean corrupted TLB entries.

One other place where the TLB is flushed, is waking from deep idle
states. The flush is a side-effect of calling ->cpu_restore with the
intention of re-setting various SPRs. The flush itself is unnecessary
because in the first case, the TLB should not acquire new corrupted
TLB entries as part of sleep/wake (though they may be lost).

This type of TLB flush is coded inflexibly, several times for each CPU
type, and they have a number of problems with ISA v3.0B:

- The current radix mode of the MMU is not taken into account, it is
  always done as a hash flushn For IS=2 (LPID-matching flush from host)
  and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
  the R field does not match the current radix mode.

- ISA v3.0B hash must flush the partition and process table caches as
  well.

- ISA v3.0B radix must flush partition and process scoped translations,
  partition and process table caches, and also the page walk cache.

So consolidate the flushing code and implement it in C and inline asm
under the mm/ directory with the rest of the flush code. Add ISA v3.0B
cases for radix and hash, and use the radix flush in radix environment.

Provide a way for IS=2 (LPID flush) to specify the radix mode of the
partition. Have KVM pass in the radix mode of the guest.

Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
and move it later in the boot process after, the MMU registers are set
up and before relocation is first turned on.

The TLB flush is no longer called when restoring from deep idle states.
This was not be done as a separate step because booting secondaries
uses the same cpu_restore as idle restore, which needs the TLB flush.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:40:31 +11:00
Ram Pai a854868646 powerpc: use helper functions to get and set hash slots
replace redundant code in __hash_page_4K() and flush_hash_page()
with helper functions pte_get_hash_gslot() and pte_set_hidx()

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:25 +11:00
Ram Pai 9d2edb1848 powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6,
in the 4K backed HPTE pages.These bits continue to be used
for 64K backed HPTE pages in this patch, but will be freed
up in the next patch. The bit numbers are big-endian as
defined in the ISA3.0

The patch does the following change to the 4k HTPE backed
64K PTE's format.

H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
		below)
V0 which occupied bit 4 is not used anymore.
V1 which occupied bit 5 is not used anymore.
V2 which occupied bit 6 is not used anymore.
V3 which occupied bit 7 is not used anymore.

Before the patch, the 4k backed 64k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

After the patch, the 4k backed 64k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |  |  |  |  |x|B| |x|x|................|.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

the four bits S,G,I,X (one quadruplet per 4k HPTE) that
cache the hash-bucket slot value, is initialized to
1,1,1,1 indicating -- an invalid slot. If a HPTE gets
cached in a 1111 slot(i.e 7th slot of secondary hash
bucket), it is released immediately. In other words,
even though 1111 is a valid slot value in the hash
bucket, we consider it invalid and release the slot and
the HPTE. This gives us the opportunity to determine
the validity of S,G,I,X bits based on its contents and
not on any of the bits V0,V1,V2 or V3 in the primary PTE

When we release a HPTE cached in the 1111 slot
we also release a legitimate slot in the primary
hash bucket and unmap its corresponding HPTE. This
is to ensure that we do get a HPTE cached in a slot
of the primary hash bucket, the next time we retry.

Though treating 1111 slot as invalid, reduces the
number of available slots in the hash bucket and may
have an effect on the performance, the probabilty of
hitting a 1111 slot is extermely low.

Compared to the current scheme, the above scheme
reduces the number of false hash table updates
significantly and has the added advantage of releasing
four valuable PTE bits for other purpose.

NOTE:even though bits 3, 4, 5, 6, 7 are not used when
the 64K PTE is backed by 4k HPTE, they continue to be
used if the PTE gets backed by 64k HPTE. The next
patch will decouple that aswell, and truely release the
bits.

This idea was jointly developed by Paul Mackerras,
Aneesh, Michael Ellermen and myself.

4K PTE format remains unchanged currently.

The patch does the following code changes
a) PTE flags are split between 64k and 4k header files.
b) __hash_page_4K() is reimplemented to reflect the
 above logic.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:20 +11:00
Ram Pai 318995b4f5 powerpc: introduce pte_get_hash_gslot() helper
Introduce pte_get_hash_gslot()() which returns the global slot number of
the HPTE in the global hash table.

This function will come in handy as we work towards re-arranging the PTE
bits in the later patches.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:19 +11:00
Aneesh Kumar K.V 7f142661d4 powerpc/mm/hash: Add pr_fmt() to hash_utils64.c
Make the printks look a bit nicer by adding a prefix.

Radix config now do
 radix-mmu: Page sizes from device-tree:
 radix-mmu: Page size shift = 12 AP=0x0
 radix-mmu: Page size shift = 16 AP=0x5
 radix-mmu: Page size shift = 21 AP=0x1
 radix-mmu: Page size shift = 30 AP=0x2

This patch update hash config to do similar dmesg output. With the patch we have

 hash-mmu: Page sizes from device-tree:
 hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
 hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
 hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
 hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
 hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
 hash-mmu: base_shift=20: shift=20, sllp=0x0111, avpnm=0x00000000, tlbiel=0, penc=2
 hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
 hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06 16:48:13 +11:00
Benjamin Herrenschmidt b426e4bd77 powerpc/mm: Use mm_is_thread_local() instread of open-coding
We open-code testing for the mm being local to the current CPU
in a few places. Use our existing helper instead.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23 22:27:45 +10:00
Michael Ellerman 8434f0892e Merge branch 'topic/ppc-kvm' into next
Bring in the commit to rename find_linux_pte_or_hugepte() which touches
arch and KVM code, and might need to be merged with the kvmppc tree to
avoid conflicts.
2017-08-17 23:14:17 +10:00
Aneesh Kumar K.V 94171b19c3 powerpc/mm: Rename find_linux_pte_or_hugepte()
Add newer helpers to make the function usage simpler. It is always
recommended to use find_current_mm_pte() for walking the page table.
If we cannot use find_current_mm_pte(), it should be documented why
the said usage of __find_linux_pte() is safe against a parallel THP
split.

For now we have KVM code using __find_linux_pte(). This is because kvm
code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
with PACA soft_enabled = 1. We may want to fix that later and make
sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
we can switch kvm to use find_linux_pte().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17 23:13:46 +10:00
Aneesh Kumar K.V 79cc38ded1 powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line
With commit aa888a7497 ("hugetlb: support larger than MAX_ORDER") we added
support for allocating gigantic hugepages via kernel command line. Switch
ppc64 arch specific code to use that.

W.r.t FSL support, we now limit our allocation range using BOOTMEM_ALLOC_ACCESSIBLE.

We use the kernel command line to do reservation of hugetlb pages on powernv
platforms. On pseries hash mmu mode the supported gigantic huge page size is
16GB and that can only be allocated with hypervisor assist. For pseries the
command line option doesn't do the allocation. Instead pseries does gigantic
hugepage allocation based on hypervisor hint that is specified via
"ibm,expected#pages" property of the memory node.

Cc: Scott Wood <oss@buserror.net>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16 14:56:12 +10:00
Michael Ellerman 63ee9b2ff9 powerpc/mm/book3s64: Make KERN_IO_START a variable
Currently KERN_IO_START is defined as:

 #define KERN_IO_START  (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1))

Although it looks like a constant, both the components are actually
variables, to allow us to have a different value between Radix and
Hash with a single kernel.

However that still requires both Radix and Hash to place the kernel IO
region at the same location relative to the start and end of the
kernel virtual region (namely 1/2 way through it), and we'd like to
change that.

So split KERN_IO_START out into its own variable, and initialise it
for Radix and Hash. In the medium term we should be able to
reconsolidate this, by doing a more involved rearrangement of the
location of the regions.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08 19:37:04 +10:00
Rui Teng 23493c1219 powerpc/mm: Fix check of multiple 16G pages from device tree
The offset of hugepage block will not be 16G, if the expected
page is more than one. Calculate the totol size instead of the
hardcode value.

Fixes: 4792adbac9 ("powerpc: Don't use a 16G page if beyond mem= limits")
Signed-off-by: Rui Teng <rui.teng@linux.vnet.ibm.com>
Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-31 16:56:58 +10:00
Balbir Singh 0428491cba powerpc/mm: Trace tlbie(l) instructions
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
Entry (Local)) instructions.

The tlbie instruction has changed over the years, so not all versions
accept the same operands. Use the ISA v3 field operands because they are
the most verbose, we may change them in future.

Example output:

  qemu-system-ppc-5371  [016]  1412.369519: tlbie:
  	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Add some missing trace_tlbie()s, reword change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-23 21:14:49 +10:00
Michael Ellerman 7644d5819c powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there
powerpc_debugfs_root is the dentry representing the root of the
"powerpc" directory tree in debugfs.

Currently it sits in asm/debug.h, a long with some other things that
have "debug" in the name, but are otherwise unrelated.

Pull it out into a separate header, which also includes linux/debugfs.h,
and convert all the users to include debugfs.h instead of debug.h.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:03 +10:00
Oliver O'Halloran f6f9195ba0 powerpc/mm: Remove stale comment about the DART hole
The code to fix the problem it describes was removed in commit
c40785ad30 ("powerpc/dart: Use a cachable DART"), and it uses the
stupid comment style. Away it goooooooooooooes!

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03 23:34:48 +10:00
Aneesh Kumar K.V 82228e362f powerpc/pseries: Skip using reserved virtual address range
Now that we use all the available virtual address range, we need to make
sure we don't generate VSID such that it overlaps with the reserved vsid
range. Reserved vsid range include the virtual address range used by the
adjunct partition and also the VRMA virtual segment. We find the context
value that can result in generating such a VSID and reserve it early in
boot.

We don't look at the adjunct range, because for now we disable the
adjunct usage in a Linux LPAR via CAS interface.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01 21:12:27 +11:00
Aneesh Kumar K.V 52b1e66587 powerpc/mm: Move copy_mm_to_paca to paca.c
We also update the function arg to struct mm_struct. Move this so that function
finds the definition of struct mm_struct. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:54 +11:00
Aneesh Kumar K.V 6aa59f5162 powerpc/mm: Move hash specific pte bits to be top bits of RPN
We don't support the full 57 bits of physical address and hence can
overload the top bits of RPN as hash specific pte bits.

Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND
and H_PAGE_F_GIX.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:52 +11:00
Ingo Molnar 589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
David Gibson 438cc81a41 powerpc/pseries: Automatically resize HPT for memory hot add/remove
We've now implemented code in the pseries platform to use the new PAPR
interface to allow resizing the hash page table (HPT) at runtime.

This patch uses that interface to automatically attempt to resize the HPT
when memory is hot added or removed.  This tries to always keep the HPT at
a reasonable size for our current memory size.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10 13:28:02 +11:00
David Gibson dbcf929c00 powerpc/pseries: Add support for hash table resizing
This adds support for using two hypercalls to change the size of the
main hash page table while running as a PAPR guest. For now these
hypercalls are only in experimental qemu versions.

The interface is two part: first H_RESIZE_HPT_PREPARE is used to
allocate and prepare the new hash table. This may be slow, but can be
done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the
new hash table. This requires that no CPUs be concurrently updating the
HPT, and so must be run under stop_machine().

This also adds a debugfs file which can be used to manually control
HPT resizing or testing purposes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Mackerras <paulus@samba.org>
[mpe: Rename the debugfs file to "hpt_order"]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10 13:27:55 +11:00
Reza Arbab 32b53c012e powerpc/mm: Fix memory hotplug BUG() on radix
Memory hotplug is leading to hash page table calls, even on radix:

  arch_add_memory
    create_section_mapping
      htab_bolt_mapping
        BUG_ON(!ppc_md.hpte_insert);

To fix, refactor {create,remove}_section_mapping() into hash__ and
radix__ variants. Leave the radix versions stubbed for now.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-17 10:05:43 +11:00
Linus Torvalds 7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds 93173b5bf2 Small release, the most interesting stuff is x86 nested virt improvements.
x86: userspace can now hide nested VMX features from guests; nested
 VMX can now run Hyper-V in a guest; support for AVX512_4VNNIW and
 AVX512_FMAPS in KVM; infrastructure support for virtual Intel GPUs.
 
 PPC: support for KVM guests on POWER9; improved support for interrupt
 polling; optimizations and cleanups.
 
 s390: two small optimizations, more stuff is in flight and will be
 in 4.11.
 
 ARM: support for the GICv3 ITS on 32bit platforms.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQExBAABCAAbBQJYTkP0FBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
 lZIH/iT1n9OQXcuTpYYnQhuCenzI3GZZOIMTbCvK2i5bo0FIJKxVn0EiAAqZSXvO
 nO185FqjOgLuJ1AD1kJuxzye5suuQp4HIPWWgNHcexLuy43WXWKZe0IQlJ4zM2Xf
 u31HakpFmVDD+Cd1qN3yDXtDrRQ79/xQn2kw7CWb8olp+pVqwbceN3IVie9QYU+3
 gCz0qU6As0aQIwq2PyalOe03sO10PZlm4XhsoXgWPG7P18BMRhNLTDqhLhu7A/ry
 qElVMANT7LSNLzlwNdpzdK8rVuKxETwjlc1UP8vSuhrwad4zM2JJ1Exk26nC2NaG
 D0j4tRSyGFIdx6lukZm7HmiSHZ0=
 =mkoB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small release, the most interesting stuff is x86 nested virt
  improvements.

  x86:
   - userspace can now hide nested VMX features from guests
   - nested VMX can now run Hyper-V in a guest
   - support for AVX512_4VNNIW and AVX512_FMAPS in KVM
   - infrastructure support for virtual Intel GPUs.

  PPC:
   - support for KVM guests on POWER9
   - improved support for interrupt polling
   - optimizations and cleanups.

  s390:
   - two small optimizations, more stuff is in flight and will be in
     4.11.

  ARM:
   - support for the GICv3 ITS on 32bit platforms"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
  arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
  KVM: arm/arm64: timer: Check for properly initialized timer on init
  KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
  KVM: x86: Handle the kthread worker using the new API
  KVM: nVMX: invvpid handling improvements
  KVM: nVMX: check host CR3 on vmentry and vmexit
  KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
  KVM: nVMX: propagate errors from prepare_vmcs02
  KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
  KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: generate non-true VMX MSRs based on true versions
  KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
  KVM: x86: Add kvm_skip_emulated_instruction and use it.
  KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
  KVM: VMX: Reorder some skip_emulated_instruction calls
  KVM: x86: Add a return value to kvm_emulate_cpuid
  KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
  ...
2016-12-13 15:47:02 -08:00