Suppose we have an environment with a number of non-NPIV FCP devices
(virtual HBAs / FCP devices / zfcp "adapter"s) sharing the same physical
FCP channel (HBA port) and its I_T nexus. Plus a number of storage target
ports zoned to such shared channel. Now one target port logs out of the
fabric causing an RSCN. Zfcp reacts with an ADISC ELS and subsequent port
recovery depending on the ADISC result. This happens on all such FCP
devices (in different Linux images) concurrently as they all receive a copy
of this RSCN. In the following we look at one of those FCP devices.
Requests other than FSF_QTCB_FCP_CMND can be slow until they get a
response.
Depending on which requests are affected by slow responses, there are
different recovery outcomes. Here we want to fix failed recoveries on port
or adapter level by avoiding recovery requests that can be slow.
We need the cached N_Port_ID for the remote port "link" test with ADISC.
Just before sending the ADISC, we now intentionally forget the old cached
N_Port_ID. The idea is that on receiving an RSCN for a port, we have to
assume that any cached information about this port is stale. This forces a
fresh new GID_PN [FC-GS] nameserver lookup on any subsequent recovery for
the same port. Since we typically can still communicate with the nameserver
efficiently, we now reach steady state quicker: Either the nameserver still
does not know about the port so we stop recovery, or the nameserver already
knows the port potentially with a new N_Port_ID and we can successfully and
quickly perform open port recovery. For the one case, where ADISC returns
successfully, we re-initialize port->d_id because that case does not
involve any port recovery.
This also solves a problem if the storage WWPN quickly logs into the fabric
again but with a different N_Port_ID. Such as on virtual WWPN takeover
during target NPIV failover.
[https://www.redbooks.ibm.com/abstracts/redp5477.html] In that case the
RSCN from the storage FDISC was ignored by zfcp and we could not
successfully recover the failover. On some later failback on the storage,
we could have been lucky if the virtual WWPN got the same old N_Port_ID
from the SAN switch as we still had cached. Then the related RSCN
triggered a successful port reopen recovery. However, there is no
guarantee to get the same N_Port_ID on NPIV FDISC.
Even though NPIV-enabled FCP devices are not affected by this problem, this
code change optimizes recovery time for gone remote ports as a side effect.
The timely drop of cached N_Port_IDs prevents unnecessary slow open port
attempts.
While the problem might have been in code before v2.6.32 commit
799b76d09a ("[SCSI] zfcp: Decouple gid_pn requests from erp") this fix
depends on the gid_pn_work introduced with that commit, so we mark it as
culprit to satisfy fix dependencies.
Note: Point-to-point remote port is already handled separately and gets its
N_Port_ID from the cached peer_d_id. So resetting port->d_id in general
does not affect PtP.
Link: https://lore.kernel.org/r/20220118165803.3667947-1-maier@linux.ibm.com
Fixes: 799b76d09a ("[SCSI] zfcp: Decouple gid_pn requests from erp")
Cc: <stable@vger.kernel.org> #2.6.32+
Suggested-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
According to the comment in check_fw_ready() we should not check the
IOP1_READY field in register SCRATCH_PAD_1 for 8008 or 8009 controllers.
However we check this very field in process_oq() for processing the highest
index interrupt vector. The highest interrupt vector is checked as the FW
is programmed to signal fatal errors through this irq.
Change that function to not check IOP1_READY for those mentioned
controllers, but do check ILA_READY in both cases.
The reason I assume that this was not hit earlier was because we always
allocated 64 MSI(X), and just did not pass the vector index check in
process_oq(), i.e. the handler never ran for vector index 63.
Link: https://lore.kernel.org/r/1642508105-95432-1-git-send-email-john.garry@huawei.com
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If FCoE is not configured, libfc/libfcoe keeps on retrying FLOGI and after
3 retries driver does a context reset and tries fipvlan again. This leads
to context reset message flooding the logs. Hence ratelimit the message to
prevent flooding the logs.
Link: https://lore.kernel.org/r/20220117135311.6256-4-njavali@marvell.com
Signed-off-by: Saurav Kashyap <skashyap@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
devm_kstrdup() returns pointer to allocated string on success, NULL on
failure. So it is better to check the return value of it.
Link: https://lore.kernel.org/r/tencent_4257E15D4A94FF9020DDCC4BB9B21C041408@qq.com
Reviewed-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
iscsit_tpg_check_network_portal() has nested for_each loops and is supposed
to return true when a match is found. However, the tpg loop will still
continue after existing the tpg_np loop. If this tpg_np is not the last the
match value will be changed.
Break the outer loop after finding a match and make sure the np under each
tpg is unique.
Link: https://lore.kernel.org/r/20220111054742.19582-1-mingzhe.zou@easystack.cn
Signed-off-by: ZouMingzhe <mingzhe.zou@easystack.cn>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
GFP_KERNEL/GFP_DMA can't be used under a spin lock. According the comment,
els_ios_lock is used to protect els ios list so we can move down the spin
lock to avoid using this flag under the lock.
Link: https://lore.kernel.org/r/20220111012441.3232527-1-yangyingliang@huawei.com
Fixes: 8f406ef728 ("scsi: elx: libefc: Extended link Service I/O handling")
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The function performs a check on the "phy" input parameter, however, it
is used before the check.
Initialize the "dev" variable after the sanity check to avoid a possible
NULL pointer dereference.
Fixes: 5c82902844 ("drm/msm/dsi: Split PHY drivers to separate files")
Addresses-Coverity-ID: 1493860 ("Null pointer dereference")
Signed-off-by: José Expósito <jose.exposito89@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20220116181844.7400-1-jose.exposito89@gmail.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
The function performs a check on the "ctx" input parameter, however, it
is used before the check.
Initialize the "base" variable after the sanity check to avoid a
possible NULL pointer dereference.
Fixes: 4259ff7ae5 ("drm/msm/dpu: add support for pcc color block in dpu driver")
Addresses-Coverity-ID: 1493866 ("Null pointer dereference")
Signed-off-by: José Expósito <jose.exposito89@gmail.com>
Link: https://lore.kernel.org/r/20220109192431.135949-1-jose.exposito89@gmail.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
The reference taken by 'of_find_device_by_node()' must be released when
not needed anymore.
Add the corresponding 'put_device()' in the error handling path.
Fixes: e00012b256 ("drm/msm/hdmi: Make HDMI core get its PHY")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20220107085026.23831-1-linmq006@gmail.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
If of_find_device_by_node() succeeds, dsi_get_phy() doesn't
a corresponding put_device(). Thus add put_device() to fix the exception
handling.
Fixes: ec31abf ("drm/msm/dsi: Separate PHY to another platform device")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20211230070943.18116-1-linmq006@gmail.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
The code that uses variable mdss has been removed, So the declaration
and assignment of the variable can be removed.
Eliminate the following clang warning:
drivers/gpu/drm/msm/msm_drv.c:513:19: warning: variable 'mdss' set but
not used [-Wunused-but-set-variable]
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 2027e5b341 ("drm/msm: Initialize MDSS irq domain at probe time")
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20211216031103.34146-1-yang.lee@linux.alibaba.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Paweł Marciniak reports the following crash, observed when clearing
the chassis intrusion alarm.
BUG: kernel NULL pointer dereference, address: 0000000000000028
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 4815 Comm: bash Tainted: G S 5.16.2-200.fc35.x86_64 #1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.60A 05/03/2018
RIP: 0010:clear_caseopen+0x5a/0x120 [nct6775]
Code: 68 70 e8 e9 32 b1 e3 85 c0 0f 85 d2 00 00 00 48 83 7c 24 ...
RSP: 0018:ffffabcb02803dd8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
RDX: ffff8e8808192880 RSI: 0000000000000000 RDI: ffff8e87c7509a68
RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000000a
R10: 000000000000000a R11: f000000000000000 R12: 000000000000001f
R13: ffff8e87c7509828 R14: ffff8e87c7509a68 R15: ffff8e88494527a0
FS: 00007f4db9151740(0000) GS:ffff8e8ebfec0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 0000000166b66001 CR4: 00000000001706e0
Call Trace:
<TASK>
kernfs_fop_write_iter+0x11c/0x1b0
new_sync_write+0x10b/0x180
vfs_write+0x209/0x2a0
ksys_write+0x4f/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The problem is that the device passed to clear_caseopen() is the hwmon
device, not the platform device, and the platform data is not set in the
hwmon device. Store the pointer to sio_data in struct nct6775_data and
get if from there if needed.
Fixes: 2e7b988696 ("hwmon: (nct6775) Use superio_*() function pointers in sio_data.")
Cc: Denis Pauk <pauk.denis@gmail.com>
Cc: Bernhard Seibold <mail@bernhard-seibold.de>
Reported-by: Paweł Marciniak <pmarciniak@lodz.home.pl>
Tested-by: Denis Pauk <pauk.denis@gmail.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmHu4c0THG1rbEBwZW5n
dXRyb25peC5kZQAKCRCpyVqK+u3vqcJOB/0eZ4URSNZ1sf1LWYbKs+DAtr08R6Hf
xmjyNsefFCFbTwLC2OESfv51b/eZR0Bt9ZxqfuYmS63TSUUwCTNHMj/sSvKqWX/e
LzsNezz5A/8rsLjhIZALVWgunjOZxq45oXtMzmv5kswSAEjy0TOQLo4zki3/YxtA
ULfNJ9zpKtzkFr7OEM5uNU8VN1e5ioMiOclHZVMFL20pR6QOS4lvG+P+Or5lmUAE
Hb/sChOF6yqgeKGk3ErBL5VregphxgPTYF5G7PlyOidYaB9VjCPjTzPlJ9/L2hdS
mmQB0Ev2ChgKgCuAFt/R/JtNiZ6/a2tzTxDYxjQcDbP+kRjUNb3Tgcyx
=iMRA
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-5.17-20220124' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2022-01-24
The first patch updates the email address of Brian Silverman from his
former employer to his private address.
The next patch fixes DT bindings information for the tcan4x5x SPI CAN
driver.
The following patch targets the m_can driver and fixes the
introduction of FIFO bulk read support.
Another patch for the tcan4x5x driver, which fixes the max register
value for the regmap config.
The last patch for the flexcan driver marks the RX mailbox support for
the MCF5441X as support.
* tag 'linux-can-fixes-for-5.17-20220124' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: flexcan: mark RX via mailboxes as supported on MCF5441X
can: tcan4x5x: regmap: fix max register value
can: m_can: m_can_fifo_{read,write}: don't read or write from/to FIFO if length is 0
dt-bindings: can: tcan4x5x: fix mram-cfg RX FIFO config
mailmap: update email address of Brian Silverman
====================
Link: https://lore.kernel.org/r/20220124175955.3464134-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most flexcan IP cores support 2 RX modes:
- FIFO
- mailbox
The flexcan IP core on the MCF5441X cannot receive CAN RTR messages
via mailboxes. However the mailbox mode is more performant. The commit
| 1c45f5778a ("can: flexcan: add ethtool support to change rx-rtr setting during runtime")
added support to switch from FIFO to mailbox mode on these cores.
After testing the mailbox mode on the MCF5441X by Angelo Dureghello,
this patch marks it (without RTR capability) as supported. Further the
IP core overview table is updated, that RTR reception via mailboxes is
not supported.
Link: https://lore.kernel.org/all/20220121084425.3141218-1-mkl@pengutronix.de
Tested-by: Angelo Dureghello <angelo@kernel-space.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The MRAM of the tcan4x5x has a size of 2K and starts at 0x8000. There
are no further registers in the tcan4x5x making 0x87fc the biggest
addressable register.
This patch fixes the max register value of the regmap config from
0x8ffc to 0x87fc.
Fixes: 6e1caaf8ed ("can: tcan4x5x: fix max register value")
Link: https://lore.kernel.org/all/20220119064011.2943292-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
In order to optimize FIFO access, especially on m_can cores attached
to slow busses like SPI, in patch
| e39381770e ("can: m_can: Disable IRQs on FIFO bus errors")
bulk read/write support has been added to the m_can_fifo_{read,write}
functions.
That change leads to the tcan driver to call
regmap_bulk_{read,write}() with a length of 0 (for CAN frames with 0
data length). regmap treats this as an error:
| tcan4x5x spi1.0 tcan4x5x0: FIFO write returned -22
This patch fixes the problem by not calling the
cdev->ops->{read,write)_fifo() in case of a 0 length read/write.
Fixes: e39381770e ("can: m_can: Disable IRQs on FIFO bus errors")
Link: https://lore.kernel.org/all/20220114155751.2651888-1-mkl@pengutronix.de
Cc: stable@vger.kernel.org
Cc: Matt Kline <matt@bitbashing.io>
Cc: Chandrasekar Ramakrishnan <rcsekar@samsung.com>
Reported-by: Michael Anochin <anochin@photo-meter.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This tcan4x5x only comes with 2K of MRAM, a RX FIFO with a dept of 32
doesn't fit into the MRAM. Use a depth of 16 instead.
Fixes: 4edd396a19 ("dt-bindings: can: tcan4x5x: Add DT bindings for TCAN4x5X driver")
Link: https://lore.kernel.org/all/20220119062951.2939851-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Brian Silverman's address at bluerivertech.com is not valid anymore,
use Brian's private email address instead.
Link: https://lore.kernel.org/all/20220110082359.2019735-1-mkl@pengutronix.de
Cc: Brian Silverman <bsilver16384@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When starting a defrag, we should update the writeback index of the
inode's mapping in case it currently has a value beyond the start of the
range we are defragging. This can help performance and often result in
getting less extents after writeback - for e.g., if the current value
of the writeback index sits somewhere in the middle of a range that
gets dirty by the defrag, then after writeback we can get two smaller
extents instead of a single, larger extent.
We used to have this before the refactoring in 5.16, but it was removed
without any reason to do so. Originally it was added in kernel 3.1, by
commit 2a0f7f5769 ("Btrfs: fix recursive auto-defrag"), in order to
fix a loop with autodefrag resulting in dirtying and writing pages over
and over, but some testing on current code did not show that happening,
at least with the test described in that commit.
So add back the behaviour, as at the very least it is a nice to have
optimization.
Fixes: 7b508037d4 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A defrag operation can dirty a lot of pages, specially if operating on
the entire file or a large file range. Any task dirtying pages should
periodically call balance_dirty_pages_ratelimited(), as stated in that
function's comments, otherwise they can leave too many dirty pages in
the system. This is what we did before the refactoring in 5.16, and
it should have remained, just like in the buffered write path and
relocation. So restore that behaviour.
Fixes: 7b508037d4 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When defragging we can end up collecting a range for defrag that has
already pages under delalloc (dirty), as long as the respective extent
map for their range is not mapped to a hole, a prealloc extent or
the extent map is from an old generation.
Most of the time that is harmless from a functional perspective at
least, however it can result in a deadlock:
1) At defrag_collect_targets() we find an extent map that meets all
requirements but there's delalloc for the range it covers, and we add
its range to list of ranges to defrag;
2) The defrag_collect_targets() function is called at defrag_one_range(),
after it locked a range that overlaps the range of the extent map;
3) At defrag_one_range(), while the range is still locked, we call
defrag_one_locked_target() for the range associated to the extent
map we collected at step 1);
4) Then finally at defrag_one_locked_target() we do a call to
btrfs_delalloc_reserve_space(), which will reserve data and metadata
space. If the space reservations can not be satisfied right away, the
flusher might be kicked in and start flushing delalloc and wait for
the respective ordered extents to complete. If this happens we will
deadlock, because both flushing delalloc and finishing an ordered
extent, requires locking the range in the inode's io tree, which was
already locked at defrag_collect_targets().
So fix this by skipping extent maps for which there's already delalloc.
Fixes: eb793cf857 ("btrfs: defrag: introduce helper to collect target file extents")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mark the start_backtrace() as notrace and NOKPROBE_SYMBOL
because this function is called from ftrace and lockdep to
get the caller address via return_address(). The lockdep
is used in kprobes, it should also be NOKPROBE_SYMBOL.
Fixes: b07f349966 ("arm64: stacktrace: Move start_backtrace() out of the header")
Cc: <stable@vger.kernel.org> # 5.13.x
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/164301227374.1433152.12808232644267107415.stgit@devnote2
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
In the WIN10 version of the Synthetic Video protocol with Hyper-V,
Hyper-V reports a list of supported resolutions as part of the protocol
negotiation. The driver calculates the maximum width and height from
the list of resolutions, and uses those maximums to validate any screen
resolution specified in the video= option on the kernel boot line.
This method of validation is incorrect. For example, the list of
supported resolutions could contain 1600x1200 and 1920x1080, both of
which fit in an 8 Mbyte frame buffer. But calculating the max width
and height yields 1920 and 1200, and 1920x1200 resolution does not fit
in an 8 Mbyte frame buffer. Unfortunately, this resolution is accepted,
causing a kernel fault when the driver accesses memory outside the
frame buffer.
Instead, validate the specified screen resolution by calculating
its size, and comparing against the frame buffer size. Delete the
code for calculating the max width and height from the list of
resolutions, since these max values have no use. Also add the
frame buffer size to the info message to aid in understanding why
a resolution might be rejected.
Fixes: 67e7cdb482 ("video: hyperv: hyperv_fb: Obtain screen resolution from Hyper-V host")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Acked-by: Helge Deller <deller@gmx.de>
Link: https://lore.kernel.org/r/1642360711-2335-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The async parameter of hva_to_pfn_remapped() is not used, so remove it.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Message-Id: <20220124020456.156386-1-xianting.tian@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zero vmcs.HOST_IA32_SYSENTER_ESP when initializing *constant* host state
if and only if SYSENTER cannot be used, i.e. the kernel is a 64-bit
kernel and is not emulating 32-bit syscalls. As the name suggests,
vmx_set_constant_host_state() is intended for state that is *constant*.
When SYSENTER is used, SYSENTER_ESP isn't constant because stacks are
per-CPU, and the VMCS must be updated whenever the vCPU is migrated to a
new CPU. The logic in vmx_vcpu_load_vmcs() doesn't differentiate between
"never loaded" and "loaded on a different CPU", i.e. setting SYSENTER_ESP
on VMCS load also handles setting correct host state when the VMCS is
first loaded.
Because a VMCS must be loaded before it is initialized during vCPU RESET,
zeroing the field in vmx_set_constant_host_state() obliterates the value
that was written when the VMCS was loaded. If the vCPU is run before it
is migrated, the subsequent VM-Exit will zero out MSR_IA32_SYSENTER_ESP,
leading to a #DF on the next 32-bit syscall.
double fault: 0000 [#1] SMP
CPU: 0 PID: 990 Comm: stable Not tainted 5.16.0+ #97
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
EIP: entry_SYSENTER_32+0x0/0xe7
Code: <9c> 50 eb 17 0f 20 d8 a9 00 10 00 00 74 0d 25 ff ef ff ff 0f 22 d8
EAX: 000000a2 EBX: a8d1300c ECX: a8d13014 EDX: 00000000
ESI: a8f87000 EDI: a8d13014 EBP: a8d12fc0 ESP: 00000000
DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00210093
CR0: 80050033 CR2: fffffffc CR3: 02c3b000 CR4: 00152e90
Fixes: 6ab8a4053f ("KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP)")
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220122015211.1468758-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we fail to expand inode from inline format to a normal format, we
restore inode to contain the original inline formatting but we forgot to
set i_lenAlloc back. The mismatch between i_lenAlloc and i_size was then
causing further problems such as warnings and lost data down the line.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
CC: stable@vger.kernel.org
Fixes: 7e49b6f248 ("udf: Convert UDF to new truncate calling sequence")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_expand_file_adinicb() calls directly ->writepage to write data
expanded into a page. This however misses to setup inode for writeback
properly and so we can crash on inode->i_wb dereference when submitting
page for IO like:
BUG: kernel NULL pointer dereference, address: 0000000000000158
#PF: supervisor read access in kernel mode
...
<TASK>
__folio_start_writeback+0x2ac/0x350
__block_write_full_page+0x37d/0x490
udf_expand_file_adinicb+0x255/0x400 [udf]
udf_file_write_iter+0xbe/0x1b0 [udf]
new_sync_write+0x125/0x1c0
vfs_write+0x28e/0x400
Fix the problem by marking the page dirty and going through the standard
writeback path to write the page. Strictly speaking we would not even
have to write the page but we want to catch e.g. ENOSPC errors early.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
CC: stable@vger.kernel.org
Fixes: 52ebea749a ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
The tx_coalesce and mii_irq are not used at all now, so remove them.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.
This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.
To fix the regression in pseudo filesystems, convert d_delete() calls
to d_drop() (see commit 46c46f8df9 ("devpts_pty_kill(): don't bother
with d_delete()") and move the fsnotify hook after d_drop().
Add a missing fsnotify_unlink() hook in nfsdfs that was found during
the audit of fsnotify hooks in pseudo filesystems.
Note that the fsnotify hooks in simple_recursive_removal() follow
d_invalidate(), so they require no change.
Link: https://lore.kernel.org/r/20220120215305.282577-2-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Apparently, there are some applications that use IN_DELETE event as an
invalidation mechanism and expect that if they try to open a file with
the name reported with the delete event, that it should not contain the
content of the deleted file.
Commit 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.
This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.
To fix the regression, create a new hook fsnotify_delete() that takes
the unlinked inode as an argument and use a helper d_delete_notify() to
pin the inode, so we can pass it to fsnotify_delete() after d_delete().
Backporting hint: this regression is from v5.3. Although patch will
apply with only trivial conflicts to v5.4 and v5.10, it won't build,
because fsnotify_delete() implementation is different in each of those
versions (see fsnotify_link()).
A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
filesystem that do not need to call d_delete().
Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Simplify code by using bitmap_weight() and bitmap_zero() instead of
hand-writing these functions.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When 'ping' changes to use PING socket instead of RAW socket by:
# sysctl -w net.ipv4.ping_group_range="0 100"
the selftests 'router_broadcast.sh' will fail, as such command
# ip vrf exec vrf-h1 ping -I veth0 198.51.100.255 -b
can't receive the response skb by the PING socket. It's caused by mismatch
of sk_bound_dev_if and dif in ping_rcv() when looking up the PING socket,
as dif is vrf-h1 if dif's master was set to vrf-h1.
This patch is to fix this regression by also checking the sk_bound_dev_if
against sdif so that the packets can stil be received even if the socket
is not bound to the vrf device but to the real iif.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If compiling the arm64 kernel with W=1 the following warning is produced:
| arch/arm64/kernel/vdso/vgettimeofday.c:9:5: error: no previous prototype for ‘__kernel_clock_gettime’ [-Werror=missing-prototypes]
| 9 | int __kernel_clock_gettime(clockid_t clock,
| | ^~~~~~~~~~~~~~~~~~~~~~
| arch/arm64/kernel/vdso/vgettimeofday.c:15:5: error: no previous prototype for ‘__kernel_gettimeofday’ [-Werror=missing-prototypes]
| 15 | int __kernel_gettimeofday(struct __kernel_old_timeval *tv,
| | ^~~~~~~~~~~~~~~~~~~~~
| arch/arm64/kernel/vdso/vgettimeofday.c:21:5: error: no previous prototype for ‘__kernel_clock_getres’ [-Werror=missing-prototypes]
| 21 | int __kernel_clock_getres(clockid_t clock_id,
| | ^~~~~~~~~~~~~~~~~~~~~
This patch removes "-Wmissing-prototypes" and "-Wmissing-declarations" compilers
flags from the compilation of vgettimeofday.c to make possible to build the
kernel with CONFIG_WERROR enabled.
Cc: Will Deacon <will@kernel.org>
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Tested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20220121121234.47273-1-vincenzo.frascino@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
We encountered a crash in smc_setsockopt() and it is caused by
accessing smc->clcsock after clcsock was released.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 50309 Comm: nginx Kdump: loaded Tainted: G E 5.16.0-rc4+ #53
RIP: 0010:smc_setsockopt+0x59/0x280 [smc]
Call Trace:
<TASK>
__sys_setsockopt+0xfc/0x190
__x64_sys_setsockopt+0x20/0x30
do_syscall_64+0x34/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f16ba83918e
</TASK>
This patch tries to fix it by holding clcsock_release_lock and
checking whether clcsock has already been released before access.
In case that a crash of the same reason happens in smc_getsockopt()
or smc_switch_to_fallback(), this patch also checkes smc->clcsock
in them too. And the caller of smc_switch_to_fallback() will identify
whether fallback succeeds according to the return value.
Fixes: fd57770dd1 ("net/smc: wait for pending work before clcsock release_sock")
Link: https://lore.kernel.org/lkml/5dd7ffd1-28e2-24cc-9442-1defec27375e@linux.ibm.com/T/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With previous bug fix, ->wait_capability flag is no longer needed and can
be removed.
Fixes: 249168ad07 ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs")
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ibmvnic_tasklet() continuously spins waiting for responses to all
capability requests. It does this to avoid encountering an error
during initialization of the vnic. However if there is a bug in the
VIOS and we do not receive a response to one or more queries the
tasklet ends up spinning continuously leading to hard lock ups.
If we fail to receive a message from the VIOS it is reasonable to
timeout the login attempt rather than spin indefinitely in the tasklet.
Fixes: 249168ad07 ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs")
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use ->running_cap_crqs to determine when the ibmvnic_tasklet() should
send out the next protocol message type. i.e when we get back responses
to all our QUERY_CAPABILITY CRQs we send out REQUEST_CAPABILITY crqs.
Similiary, when we get responses to all the REQUEST_CAPABILITY crqs, we
send out the QUERY_IP_OFFLOAD CRQ.
We currently increment ->running_cap_crqs as we send out each CRQ and
have the ibmvnic_tasklet() send out the next message type, when this
running_cap_crqs count drops to 0.
This assumes that all the CRQs of the current type were sent out before
the count drops to 0. However it is possible that we send out say 6 CRQs,
get preempted and receive all the 6 responses before we send out the
remaining CRQs. This can result in ->running_cap_crqs count dropping to
zero before all messages of the current type were sent and we end up
sending the next protocol message too early.
Instead initialize the ->running_cap_crqs upfront so the tasklet will
only send the next protocol message after all responses are received.
Use the cap_reqs local variable to also detect any discrepancy (either
now or in future) in the number of capability requests we actually send.
Currently only send_query_cap() is affected by this behavior (of sending
next message early) since it is called from the worker thread (during
reset) and from application thread (during ->ndo_open()) and they can be
preempted. send_request_cap() is only called from the tasklet which
processes CRQ responses sequentially, is not be affected. But to
maintain the existing symmtery with send_query_capability() we update
send_request_capability() also.
Fixes: 249168ad07 ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs")
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If auto-priority-failover (APF) is enabled and there are at least two
backing devices of different priorities, some resets like fail-over,
change-param etc can cause at least two back to back failovers. (Failover
from high priority backing device to lower priority one and then back
to the higher priority one if that is still functional).
Depending on the timimg of the two failovers it is possible to trigger
a "hard" reset and for the hard reset to fail due to failovers. When this
occurs, the driver assumes that the network is unstable and disables the
VNIC for a 60-second "settling time". This in turn can cause the ethtool
command to fail with "No such device" while the vnic automatically recovers
a little while later.
Given that it's possible to have two back to back failures, allow for extra
failures before disabling the vnic for the settling time.
Fixes: f15fde9d47 ("ibmvnic: delay next reset if hard reset fails")
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During IP fragmentation we sanitize IP options. This means overwriting
options which should not be copied with NOPs. Only the first fragment
has the original, full options.
ip_fraglist_prepare() copies the IP header and options from previous
fragment to the next one. Commit 19c3401a91 ("net: ipv4: place control
buffer handling away from fragmentation iterators") moved sanitizing
options before ip_fraglist_prepare() which means options are sanitized
and then overwritten again with the old values.
Fixing this is not enough, however, nor did the sanitization work
prior to aforementioned commit.
ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
for the length of the options. ipcb->opt of fragments is not populated
(it's 0), only the head skb has the state properly built. So even when
called at the right time ip_options_fragment() does nothing. This seems
to date back all the way to v2.5.44 when the fast path for pre-fragmented
skbs had been introduced. Prior to that ip_options_build() would have been
called for every fragment (in fact ever since v2.5.44 the fragmentation
handing in ip_options_build() has been dead code, I'll clean it up in
-next).
In the original patch (see Link) caixf mentions fixing the handling
for fragments other than the second one, but I'm not sure how _any_
fragment could have had their options sanitized with the code
as it stood.
Tested with python (MTU on lo lowered to 1000 to force fragmentation):
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
bytearray([7,4,5,192, 20|0x80,4,1,0]))
s.sendto(b'1'*2000, ('127.0.0.1', 1234))
Before:
IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.36500 > localhost.search-agent: UDP, length 2000
IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost > localhost: udp
After:
IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
localhost > localhost: udp
RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
Fixes: 19c3401a91 ("net: ipv4: place control buffer handling away from fragmentation iterators")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: caixf <ooppublic@163.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>