AT instructions do a translation table walk and return the result, or
the fault in PAR_EL1. KVM uses these to find the IPA when the value is
not provided by the CPU in HPFAR_EL1.
If a translation table walk causes an external abort it is taken as an
exception, even if it was due to an AT instruction. (DDI0487F.a's D5.2.11
"Synchronous faults generated by address translation instructions")
While we previously made KVM resilient to exceptions taken due to AT
instructions, the device access causes mismatched attributes, and may
occur speculatively. Prevent this, by forbidding a walk through memory
described as device at stage2. Now such AT instructions will report a
stage2 fault.
Such a fault will cause KVM to restart the guest. If the AT instructions
always walk the page tables, but guest execution uses the translation cached
in the TLB, the guest can't make forward progress until the TLB entry is
evicted. This isn't a problem, as since commit 5dcd0fdbb4 ("KVM: arm64:
Defer guest entry when an asynchronous exception is pending"), KVM will
return to the host to process IRQs allowing the rest of the system to keep
running.
Cc: stable@vger.kernel.org # <v5.3: 5dcd0fdbb4 ("KVM: arm64: Defer guest entry when an asynchronous exception is pending")
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
KVM doesn't expect any synchronous exceptions when executing, any such
exception leads to a panic(). AT instructions access the guest page
tables, and can cause a synchronous external abort to be taken.
The arm-arm is unclear on what should happen if the guest has configured
the hardware update of the access-flag, and a memory type in TCR_EL1 that
does not support atomic operations. B2.2.6 "Possible implementation
restrictions on using atomic instructions" from DDI0487F.a lists
synchronous external abort as a possible behaviour of atomic instructions
that target memory that isn't writeback cacheable, but the page table
walker may behave differently.
Make KVM robust to synchronous exceptions caused by AT instructions.
Add a get_user() style helper for AT instructions that returns -EFAULT
if an exception was generated.
While KVM's version of the exception table mixes synchronous and
asynchronous exceptions, only one of these can occur at each location.
Re-enter the guest when the AT instructions take an exception on the
assumption the guest will take the same exception. This isn't guaranteed
to make forward progress, as the AT instructions may always walk the page
tables, but guest execution may use the translation cached in the TLB.
This isn't a problem, as since commit 5dcd0fdbb4 ("KVM: arm64: Defer guest
entry when an asynchronous exception is pending"), KVM will return to the
host to process IRQs allowing the rest of the system to keep running.
Cc: stable@vger.kernel.org # <v5.3: 5dcd0fdbb4 ("KVM: arm64: Defer guest entry when an asynchronous exception is pending")
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
KVM has a one instruction window where it will allow an SError exception
to be consumed by the hypervisor without treating it as a hypervisor bug.
This is used to consume asynchronous external abort that were caused by
the guest.
As we are about to add another location that survives unexpected exceptions,
generalise this code to make it behave like the host's extable.
KVM's version has to be mapped to EL2 to be accessible on nVHE systems.
The SError vaxorcism code is a one instruction window, so has two entries
in the extable. Because the KVM code is copied for VHE and nVHE, we end up
with four entries, half of which correspond with code that isn't mapped.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
vdso32 should only be installed if CONFIG_COMPAT_VDSO is enabled,
since it's not even supposed to be compiled otherwise, and arm64
builds without a 32bit crosscompiler will fail.
Fixes: 8d75785a81 ("ARM64: vdso32: Install vdso32 from vdso_install")
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Cc: stable@vger.kernel.org [5.4+]
Link: https://lore.kernel.org/r/20200827234012.19757-1-fllinden@amazon.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Commit 7c78f67e9b ("arm64: enable tlbi range instructions") breaks
LLVM's integrated assembler, because -Wa,-march is only passed to
external assemblers and therefore, the new instructions are not enabled
when IAS is used.
This change adds a common architecture version preamble, which can be
used in inline assembly blocks that contain instructions that require
a newer architecture version, and uses it to fix __TLBI_0 and __TLBI_1
with ARM64_TLB_RANGE.
Fixes: 7c78f67e9b ("arm64: enable tlbi range instructions")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1106
Link: https://lore.kernel.org/r/20200827203608.1225689-1-samitolvanen@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Commit e49aa9a9bd22 ("mfd: core: Make a best effort attempt to match
devices with the correct of_nodes") changed the semantics for disabled
devices in mfd_add_device(). Instead of silently ignoring a disabled
child device, an error was returned. On receipt of the error
mfd_add_devices() the precedes to remove *all* child devices and
returns an all-failed error to the caller, which will inevitably fail
the parent device as well.
This patch reverts back to the old semantics and ignores child devices
which are disabled in Device Tree.
Fixes: e49aa9a9bd22 ("mfd: core: Make a best effort attempt to match devices with the correct of_nodes")
Reported-by: Icenowy Zheng <icenowy@aosc.io>
Tested-by: Icenowy Zheng <icenowy@aosc.io>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
The PSZ-HA* family of USB disk drives from Sony can't handle the
REPORT OPCODES command when using the UAS protocol. This patch adds
an appropriate quirks entry.
Reported-and-tested-by: Till Dörges <doerges@pre-sense.de>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200826143229.GB400430@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 3b5408b98e ("md/raid5: support config stripe_size by sysfs
entry") make stripe_size as a configurable value. It just requires
stripe_size as multiple of 4KB.
In fact, we should make sure stripe_size as power of two. Otherwise,
stripe_shift which is the result of ilog2 can not represent the real
stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
get unexpected value.
Fixes: 3b5408b98e ("md/raid5: support config stripe_size by sysfs entry")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
These events happen inline from submission, so there's no need to
bounce them through the original task. Just set them up for retry
and issue retry directly instead of going over task_work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This normally isn't hit, as polling is mostly done on NVMe with deep
queue depths. But if we do run into request starvation, we need to
ensure that retries are properly serialized.
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fallthrough annotations for consecutive default and case labels
are not necessary.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
The fall through annotation comes after a return statement so it's not
reachable.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
After
b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes
a NULL pointer after the second ->probe() (aka ghes_edac_register())
which the config option causes to be called.
This happens because the static variable which holds down whether
the system has been scanned already, doesn't get reset in
ghes_edac_unregister(). Then, on the second probe, ghes_scan_system()
doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining
NULL.
Clear the variable and rename it to something more descriptive so that a
second probe succeeds.
[ bp: Rewrite commit message. ]
Fixes: b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com
The iwd daemon uses libell which sets up the skcipher operation with
two separate control messages. As the first control message is sent
without MSG_MORE, it is interpreted as an empty request.
While libell should be fixed to use MSG_MORE where appropriate, this
patch works around the bug in the kernel so that existing binaries
continue to work.
We will print a warning however.
A separate issue is that the new kernel code no longer allows the
control message to be sent twice within the same request. This
restriction is obviously incompatible with what iwd was doing (first
setting an IV and then sending the real control message). This
patch changes the kernel so that this is explicitly allowed.
Reported-by: Caleb Jorden <caljorden@hotmail.com>
Fixes: f3c802a1f3 ("crypto: algif_aead - Only wake up when...")
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It should be considered an illegal operation if the ULP attempts to modify
a QP from another state to the current hardware state. Otherwise, the ULP
can modify some fields of QPC at any time. For example, for a QP in state
of RTS, modify it from RTR to RTS can change the PSN, which is always not
as expected.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Link: https://lore.kernel.org/r/1598353674-24270-1-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the query_pkey() isn't mandatory by the RDMA core, this callback
can be removed from the usnic provider. The libfabric userspace never
touches the pkey.
Link: https://lore.kernel.org/r/20200820125346.111902-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use cancel_work_sync() to ensure that the wq is not running and simply
assign NULL to ctx->cm_id to indicate if the work ran or not. Delete the
close_wq since flush_workqueue() is no longer needed.
Link: https://lore.kernel.org/r/20200818120526.702120-15-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When a new connection is established the RDMA CM creates a new cm_id and
passes it through to the event handler. However inside the UCMA the new ID
is not assigned a ucma_context until the user retrieves the event from a
syscall.
This creates a weird edge condition where a cm_id's context can continue
to point at the listening_id that created it, and a number of additional
edge conditions on event list clean up related to destroying half created
IDs.
There is also a race condition in ucma_get_events() where the
cm_id->context is being assigned without holding the handler_mutex.
Simplify all of this by creating the ucma_context inside the event handler
itself and eliminating the edge case of a half created cm_id. All cm_id's
can be uniformly destroyed via __destroy_id() or via the close_work.
Link: https://lore.kernel.org/r/20200818120526.702120-14-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since the backlog is now an atomic the file->mut is now only protecting
the event_list and ctx_list. Narrow its scope to make it clear
Link: https://lore.kernel.org/r/20200818120526.702120-13-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
All entry points to the rdma_cm from a ULP must be single threaded,
even this error unwinds. Add the missing locking.
Fixes: 7c11910783 ("RDMA/ucma: Put a lock around every call to the rdma_cm layer")
Link: https://lore.kernel.org/r/20200818120526.702120-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This value is locked under the file->mut, ensure it is held whenever
touching it.
The case in ucma_migrate_id() is a race, while in ucma_free_uctx() it is
already not possible for the write side to run, the movement is just for
clarity.
Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/20200818120526.702120-10-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ctx->file is changed under the file->mut lock by ucma_migrate_id(), which
is impossible to lock correctly. Instead change ctx->file under the
handler_lock and ctx_table lock and revise all places touching ctx->file
to use this locking when reading ctx->file.
Link: https://lore.kernel.org/r/20200818120526.702120-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The only reader of destroying is inside a handler under the handler_mutex,
so directly use the handler_mutex when setting it instead of the larger
file->mut.
As the refcount could be zero here, and the cm_id already freed, and
additional refcount grab around the locking is required to touch the
cm_id.
Link: https://lore.kernel.org/r/20200818120526.702120-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In almost all cases rdma_accept() is called under the handler_mutex by
ULPs from their handler callbacks. The one exception was ucma which did
not get the handler_mutex.
To improve the understand-ability of the locking scheme obtain the mutex
for ucma as well.
This improves how ucma works by allowing it to directly use handler_mutex
for some of its internal locking against the handler callbacks intead of
the global file->mut lock.
There does not seem to be a serious bug here, other than a DISCONNECT event
can be delivered concurrently with accept succeeding.
Link: https://lore.kernel.org/r/20200818120526.702120-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It is not really necessary to keep a linked list of mcs associated with
each context when we can just scan the xarray to find the right things.
The removes another overloading of file->mut by relying on the xarray
locking for mc instead.
Link: https://lore.kernel.org/r/20200818120526.702120-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The store to ctx->cm_id was based on the idea that _ucma_find_context()
would not return the ctx until it was fully setup.
Without locking this doesn't work properly.
Split things so that the xarray is allocated with NULL to reserve the ID
and once everything is final set the cm_id and store.
Along the way this shows that the error unwind in ucma_get_event() if a
new ctx is created is wrong, fix it up.
Link: https://lore.kernel.org/r/20200818120526.702120-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ucma_close() is open coding the tail end of ucma_destroy_id(), consolidate
this duplicated code into a function.
Link: https://lore.kernel.org/r/20200818120526.702120-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
During the file_operations release function it is already not possible
that write() can be running concurrently, remove the extra locking
around the ctx_list.
Link: https://lore.kernel.org/r/20200818120526.702120-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Both ucma_destroy_id() and ucma_close_id() (triggered from an event via a
wq) can drive the refcount to zero. ucma_get_ctx() was wrongly assuming
that the refcount can only go to zero from ucma_destroy_id() which also
removes it from the xarray.
Use refcount_inc_not_zero() instead.
Link: https://lore.kernel.org/r/20200818120526.702120-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When DCT QPs work in RoCE LAG mode:
1. DCT creation is allowed only when it is supported
2. The "port" of a DCT QP is assigned in a round-robin way
Link: https://lore.kernel.org/r/20200818115245.700581-3-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
DCI QP supports tx_affinity as well.
Link: https://lore.kernel.org/r/20200818115245.700581-2-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The relation can't be invalid here, so if it turns out to be invalid,
just WARN_ON_ONCE() and return 0.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
"cpufreq_driver" is guaranteed to be valid here, no need to check it
here.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpuidle stop state implementation has minor optimizations for P10
where hardware preserves more SPR registers compared to P9. The
current P9 driver works for P10, although does few extra
save-restores. P9 driver can provide the required power management
features like SMT thread folding and core level power savings on a P10
platform.
Until the P10 stop driver is available, revert the commit which allows
for only P9 systems to utilize cpuidle and blocks all idle stop states
for P10. CPU idle states are enabled and tested on the P10 platform
with this fix.
This reverts commit 8747bf36f3.
Fixes: 8747bf36f3 ("powerpc/powernv/idle: Replace CPU feature check with PVR check")
Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200826082918.89306-1-psampat@linux.ibm.com
IMC trace-mode uses MSR[HV/PR] bits to set the cpumode for the
instruction pointer captured in each sample. The bits are fetched from
the third double word of the trace record. Reading third double word
from IMC trace record should use be64_to_cpu() along with READ_ONCE
inorder to fetch correct MSR[HV/PR] bits. Patch addresses this change.
Currently we are using PERF_RECORD_MISC_HYPERVISOR as cpumode if MSR
HV is 1 and PR is 0 which means the address is from host counter. But
using PERF_RECORD_MISC_HYPERVISOR for host counter data will fail to
resolve the address -> symbol during "perf report" because perf tools
side uses PERF_RECORD_MISC_KERNEL to represent the host counter data.
Therefore, fix the trace imc sample data to use
PERF_RECORD_MISC_KERNEL as cpumode for host kernel information.
Fixes: 77ca3951cc ("powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc")
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1598424029-1662-1-git-send-email-atrajeev@linux.vnet.ibm.com
The bhrb_filter_map ("The Branch History Rolling Buffer") callback is
only defined in raw CPUs' power_pmu structs. The "architected" CPUs
use generic_compat_pmu, which does not have this callback, and crashes
occur if a user tries to enable branch stack for an event.
This add a NULL pointer check for bhrb_filter_map() which behaves as
if the callback returned an error.
This does not add the same check for config_bhrb() as the only caller
checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.
Fixes: be80e758d0 ("powerpc/perf: Add generic compat mode pmu driver")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200602025612.62707-1-aik@ozlabs.ru
The recent commit 01eb01877f ("powerpc/64s: Fix restore_math
unnecessarily changing MSR") changed some of the handling of floating
point/vector restore.
In particular it caused current->thread.fpexc_mode to be copied into
the current MSR (via msr_check_and_set()), rather than just into
regs->msr (which is moved into MSR on return to userspace).
This can lead to a crash in the kernel if we take a floating point
exception when restoring FPSCR:
Oops: Exception in kernel mode, sig: 8 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 3 PID: 101213 Comm: ld64.so.2 Not tainted 5.9.0-rc1-00098-g18445bf405cb-dirty #9
NIP: c00000000000fbb4 LR: c00000000001a7ac CTR: c000000000183570
REGS: c0000016b7cfb3b0 TRAP: 0700 Not tainted (5.9.0-rc1-00098-g18445bf405cb-dirty)
MSR: 900000000290b933 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44002444 XER: 00000000
CFAR: c00000000001a7a8 IRQMASK: 1
GPR00: c00000000001ae40 c0000016b7cfb640 c0000000011b7f00 c000001542a0f740
GPR04: c000001542a0f720 c000001542a0eb00 0000000000000900 c000001542a0eb00
GPR08: 000000000000000a 0000000000002000 9000000000009033 0000000000000000
GPR12: 0000000000004000 c0000017ffffd900 0000000000000001 c000000000df5a58
GPR16: c000000000e19c18 c0000000010e1123 0000000000000001 c000000000e1a638
GPR20: 0000000000000000 c0000000044b1d00 0000000000000000 c000001542a0f2a0
GPR24: 00000016c7fe0000 c000001542a0f720 c000000001c93da0 c000000000fe5f28
GPR28: c000001542a0f720 0000000000800000 c0000016b7cfbe90 0000000002802900
NIP load_fp_state+0x4/0x214
LR restore_math+0x17c/0x1f0
Call Trace:
0xc0000016b7cfb680 (unreliable)
__switch_to+0x330/0x460
__schedule+0x318/0x920
schedule+0x74/0x140
schedule_timeout+0x318/0x3f0
wait_for_completion+0xc8/0x210
call_usermodehelper_exec+0x234/0x280
do_coredump+0xedc/0x13c0
get_signal+0x1d4/0xbe0
do_notify_resume+0x1a0/0x490
interrupt_exit_user_prepare+0x1c4/0x230
interrupt_return+0x14/0x1c0
Instruction dump:
ebe10168 e88101a0 7c8ff120 382101e0 e8010010 7c0803a6 4e800020 790605c4
782905c4 7c0008a8 7c0008a8 c8030200 <fffe058e> 48000088 c8030000 c8230010
Fix it by only loading the fpexc_mode value into regs->msr.
Also add a comment to explain that although VSX is subject to the
value of fpexc_mode, we don't have to handle that separately because
we only allow VSX to be enabled if FP is also enabled.
Fixes: 01eb01877f ("powerpc/64s: Fix restore_math unnecessarily changing MSR")
Reported-by: Milton Miller <miltonm@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20200825093424.3967813-1-mpe@ellerman.id.au
Kernel entry sets PPR to HMT_MEDIUM by convention. The scv entry
path missed this.
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200825075309.224184-1-npiggin@gmail.com
Fix malformed table warning in powerpc/syscall64-abi.rst by making
two tables and moving the headings.
Documentation/powerpc/syscall64-abi.rst:53: WARNING: Malformed table.
Text in column margin in table line 2.
=========== ============= ========================================
--- For the sc instruction, differences with the ELF ABI ---
r0 Volatile (System call number.)
r3 Volatile (Parameter 1, and return value.)
r4-r8 Volatile (Parameters 2-6.)
cr0 Volatile (cr0.SO is the return error condition.)
cr1, cr5-7 Nonvolatile
lr Nonvolatile
--- For the scv 0 instruction, differences with the ELF ABI ---
r0 Volatile (System call number.)
r3 Volatile (Parameter 1, and return value.)
r4-r8 Volatile (Parameters 2-6.)
=========== ============= ========================================
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e06de4d3-a36f-2745-9775-467e125436cc@infradead.org
The build is currently broken, if COMPILE_TEST=y and PPC_PMAC=n:
linux/drivers/video/fbdev/controlfb.c: In function ‘control_set_hardware’:
linux/drivers/video/fbdev/controlfb.c:276:2: error: implicit declaration of function ‘btext_update_display’
276 | btext_update_display(p->frame_buffer_phys + CTRLFB_OFF,
| ^~~~~~~~~~~~~~~~~~~~
Fix it by including btext.h whenever CONFIG_BOOTX_TEXT is enabled.
Fixes: a07a63b0e2 ("video: fbdev: controlfb: add COMPILE_TEST support")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Link: https://lore.kernel.org/r/20200821104910.3363818-1-mpe@ellerman.id.au
Several people reported that 5.8 broke the interrupt affinity setting
mechanism.
The consolidation of the entry code reused the regular exception entry code
for device interrupts and changed the way how the vector number is conveyed
from ptregs->orig_ax to a function argument.
The low level entry uses the hardware error code slot to push the vector
number onto the stack which is retrieved from there into a function
argument and the slot on stack is set to -1.
The reason for setting it to -1 is that the error code slot is at the
position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
indicates that the entry came via a syscall. If it's not set to a negative
value then a signal delivery on return to userspace would try to restart a
syscall. But there are other places which rely on pt_regs::orig_ax being a
valid indicator for syscall entry.
But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
interrupt affinity setting mechanism, which was overlooked when this change
was made.
Moving interrupts on x86 happens in several steps. A new vector on a
different CPU is allocated and the relevant interrupt source is
reprogrammed to that. But that's racy and there might be an interrupt
already in flight to the old vector. So the old vector is preserved until
the first interrupt arrives on the new vector and the new target CPU. Once
that happens the old vector is cleaned up, but this cleanup still depends
on the vector number being stored in pt_regs::orig_ax, which is now -1.
That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
always false. As a consequence the interrupt is moved once, but then it
cannot be moved anymore because the cleanup of the old vector never
happens.
There would be several ways to convey the vector information to that place
in the guts of the interrupt handling, but on deeper inspection it turned
out that this check is pointless and a leftover from the old affinity model
of X86 which supported multi-CPU affinities. Under this model it was
possible that an interrupt had an old and a new vector on the same CPU, so
the vector match was required.
Under the new model the effective affinity of an interrupt is always a
single CPU from the requested affinity mask. If the affinity mask changes
then either the interrupt stays on the CPU and on the same vector when that
CPU is still in the new affinity mask or it is moved to a different CPU, but
it is never moved to a different vector on the same CPU.
Ergo the cleanup check for the matching vector number is not required and
can be removed which makes the dependency on pt_regs:orig_ax go away.
The remaining check for new_cpu == smp_processsor_id() is completely
sufficient. If it matches then the interrupt was successfully migrated and
the cleanup can proceed.
For paranoia sake add a warning into the vector assignment code to
validate that the assumption of never moving to a different vector on
the same CPU holds.
Fixes: 633260fa14 ("x86/irq: Convey vector as argument and not in ptregs")
Reported-by: Alex bykov <alex.bykov@scylladb.com>
Reported-by: Avi Kivity <avi@scylladb.com>
Reported-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexander Graf <graf@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de
There is a race when taking a CPU offline. Current code looks like this:
native_cpu_disable()
{
...
apic_soft_disable();
/*
* Any existing set bits for pending interrupt to
* this CPU are preserved and will be sent via IPI
* to another CPU by fixup_irqs().
*/
cpu_disable_common();
{
....
/*
* Race window happens here. Once local APIC has been
* disabled any new interrupts from the device to
* the old CPU are lost
*/
fixup_irqs(); // Too late to capture anything in IRR.
...
}
}
The fix is to disable the APIC *after* cpu_disable_common().
Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.
Fixes: 60dcaad573 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com