code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVlYACgkQEsHwGGHe
VUpxTA/9F0KsgSyTh66uX+aX5qkQ3WTBVgxbXGFrn5qPvwcALXabU8qObDWTSdwS
1YbiWDjKNBJX+dggWe/fcQgUZxu5DFkM4IKEW1V7MLJEcdfylcqCyc1YNpEI4ySn
ebw2Sy4/5iXGAvhz802/WoUU/o3A2uZwe0RFyodHGxof5027HkZhRHeYB27Htw+l
z0IsmiYOoPl/4mNuVgr/qieIFSw1SUE9kwjU8RvM6xVWmXWXpM68JHa9s+/51pFt
6BaOz485OyzWUCtSx3/++GEkU2d53bWYOuQ1zTLEiuaBfYC5n5T/kAcT4WJNK6Tf
tX7yrzmWm9ecykIxfkgMrhG57G38y2GMJcEg+dFQHeXC062fdHDg+oY6Ql2EkAm5
t5RIQ/cyOmQCLns31rHI/kwQ3RMKc/lfnL/z8lrlfWsC5o755yFJKttbfLJugbTo
3BO1fbs4xgQcgi0KoqXOUETrQtsOLtr9FJwvcArB94XXqcIPClE8Ir7n8T7FCuLr
9litSXIdn46EHwD6hD5QIk7y+Rxwk/jxZFys3eh90jcWDDZTaG2lz3if33RbZ1go
XBrS5X3HsMODGZlaMeUjrbFIz3e0Zyoo+RO/TX48w8nzivC6xSNxSNFgIZ1XTF5E
SLMGa6lEQ9mLiqRfgFjynNwSYOSlGv3euMkZaVPS3hnNmn+vZbI=
=RsCs
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
"Only AMD-specific changes this time:
- Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and
convert all code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD
(Arvind Sankar)"
* tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Remove dead code for TSEG region remapping
x86/topology: Set cpu_die_id only if DIE_TYPE found
EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
x86/CPU/AMD: Remove amd_get_nb_id()
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
The mv64x60 EDAC driver depends on CONFIG_MV64X60. But that symbol is
not user-selectable, and the last code that selected it was removed
with the C2K board support in 2018, see:
92c8c16f34 ("powerpc/embedded6xx: Remove C2K board support")
That means the driver is now dead code, so remove it.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201207040253.628528-1-mpe@ellerman.id.au
In order to setup its PCI component, the driver needs any node private
instance in order to get a reference to the PCI device and hand that
into edac_pci_create_generic_ctl(). For convenience, it uses the 0th
memory controller descriptor under the assumption that if any, the 0th
will be always present.
However, this assumption goes wrong when the 0th node doesn't have
memory and the driver doesn't initialize an instance for it:
EDAC amd64: F17h detected (node 0).
...
EDAC amd64: Node 0: No DIMMs detected.
But looking up node instances is not really needed - all one needs is
the pointer to the proper device which gets discovered during instance
init.
So stash that pointer into a variable and use it when setting up the
EDAC PCI component.
Clear that variable when the driver needs to unwind due to some
instances failing init to avoid any registration imbalance.
Cc: <stable@vger.kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201122150815.13808-1-bp@alien8.de
The Sapphire Rapids CPU model shares the same memory controller
architecture with Ice Lake server. There are some configurations
different from Ice Lake server as below:
- The device ID for configuration agent.
- The size for per channel memory-mapped I/O.
- The DDR5 memory support.
So add the above configurations and the Sapphire Rapids CPU model
ID for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add a new entry to 'enum mem_type' and a new string to
'edac_mem_types[]' for DDR5 new memory type.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Instead of raw access, use readl() to access MMIO registers of
memory controller to avoid possible compiler re-ordering.
Fixes: d4dc89d069 ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Cc: <stable@vger.kernel.org>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add debugfs support to fake memory correctable errors to test the
error reporting path and the error address decoding logic in the
igen6_edac driver.
Please note that the fake errors are also reported to EDAC core and
then the CE counter in EDAC sysfs is also increased.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This driver supports Intel client SoC with integrated memory controller
using In-Band ECC(IBECC). The memory correctable and uncorrectable errors
are reported via NMIs. The driver handles the NMIs and decodes the memory
error address to platform specific address. The first IBECC-supported SoC
is Elkhart Lake.
[Tony: s/#include <linux/nmi.h>/#include <asm/nmi.h>/ to fix randconfig build]
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and
later systems. This function is used in amd64_edac_mod to do
system-specific decoding for DRAM ECC errors. The function takes a
"NodeId" as a parameter.
In AMD documentation, NodeId is used to identify a physical die in a
system. This can be used to identify a node in the AMD_NB code and also
it is used with umc_normaddr_to_sysaddr().
However, the input used for decode_dram_ecc() is currently the NUMA node
of a logical CPU. In the default configuration, the NUMA node and
physical die will be equivalent, so this doesn't have an impact.
But the NUMA node configuration can be adjusted with optional memory
interleaving modes. This will cause the NUMA node enumeration to not
match the physical die enumeration. The mismatch will cause the address
translation function to fail or report incorrect results.
Use struct cpuinfo_x86.cpu_die_id for the node_id parameter to ensure the
physical ID is used.
Fixes: fbe63acf62 ("EDAC, mce_amd: Use cpu_to_node() to find the node ID")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-4-Yazen.Ghannam@amd.com
The Last Level Cache ID is returned by amd_get_nb_id(). In practice,
this value is the same as the AMD NodeId for callers of this function.
The NodeId is saved in struct cpuinfo_x86.cpu_die_id.
Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and
remove the function.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
Return the error value if the inject sysfs file creation fails, rather
than returning 0, to signal to the upper layer that the ->probe function
failed.
[ bp: Massage. ]
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lkml.kernel.org/r/20201116135810.3130845-1-zhangxiaoxu5@huawei.com
There are {Low-Power DDR3/4, WIO2} types of memory.
Add new entries to 'enum mem_type' and new strings to
'edac_mem_types[]' for the new types.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
They have been spreading around the subsystem by example so remove them
all.
Reported-by: Raymond Bennett <raymond.bennett@gmail.com>
Suggested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
- Preliminary RISC-V enablement - the bulk of it will arrive via the RISCV tree.
- Relax decompressed image placement rules for 32-bit ARM
- Add support for passing MOK certificate table contents via a config table
rather than a EFI variable.
- Add support for 18 bit DIMM row IDs in the CPER records.
- Work around broken Dell firmware that passes the entire Boot#### variable
contents as the command line
- Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we can
identify it in the memory map listings.
- Don't abort the boot on arm64 if the EFI RNG protocol is available but
returns with an error
- Replace slashes with exclamation marks in efivarfs file names
- Split efi-pstore from the deprecated efivars sysfs code, so we can
disable the latter on !x86.
- Misc fixes, cleanups and updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Ec9QRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1inTQ//TYj3kJq/7sWfUAxmAsWnUEC005YCNf0T
x3kJQv3zYX4Rl4eEwkff8S1PrqqvwUP5yUZYApp8HD9s9CYvzz5iG5xtf/jX+QaV
06JnTMnkoycx2NaOlbr1cmcIn4/cAhQVYbVCeVrlf7QL8enNTBr5IIQmo4mgP8Lc
mauSsO1XU8ZuMQM+JcZSxAkAPxlhz3dbR5GteP4o2K4ShQKpiTCOfOG1J3FvUYba
s1HGnhHFlkQr6m3pC+iG8dnAG0YtwHMH1eJVP7mbeKUsMXz944U8OVXDWxtn81pH
/Xt/aFZXnoqwlSXythAr6vFTuEEn40n+qoOK6jhtcGPUeiAFPJgiaeAXw3gO0YBe
Y8nEgdGfdNOMih94McRd4M6gB/N3vdqAGt+vjiZSCtzE+nTWRyIXSGCXuDVpkvL4
VpEXpPINnt1FZZ3T/7dPro4X7pXALhODE+pl36RCbfHVBZKRfLV1Mc1prAUGXPxW
E0MfaM9TxDnVhs3VPWlHmRgavee2MT1Tl/ES4CrRHEoz8ZCcu4MfROQyao8+Gobr
VR+jVk+xbyDrykEc6jdAK4sDFXpTambuV624LiKkh6Mc4yfHRhPGrmP5c5l7SnCd
aLp+scQ4T7sqkLuYlXpausXE3h4sm5uur5hNIRpdlvnwZBXpDEpkzI8x0C9OYr0Q
kvFrreQWPLQ=
=ZNI8
-----END PGP SIGNATURE-----
Merge tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI changes from Ingo Molnar:
- Preliminary RISC-V enablement - the bulk of it will arrive via the
RISCV tree.
- Relax decompressed image placement rules for 32-bit ARM
- Add support for passing MOK certificate table contents via a config
table rather than a EFI variable.
- Add support for 18 bit DIMM row IDs in the CPER records.
- Work around broken Dell firmware that passes the entire Boot####
variable contents as the command line
- Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we
can identify it in the memory map listings.
- Don't abort the boot on arm64 if the EFI RNG protocol is available
but returns with an error
- Replace slashes with exclamation marks in efivarfs file names
- Split efi-pstore from the deprecated efivars sysfs code, so we can
disable the latter on !x86.
- Misc fixes, cleanups and updates.
* tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
efi: mokvar: add missing include of asm/early_ioremap.h
efi: efivars: limit availability to X86 builds
efi: remove some false dependencies on CONFIG_EFI_VARS
efi: gsmi: fix false dependency on CONFIG_EFI_VARS
efi: efivars: un-export efivars_sysfs_init()
efi: pstore: move workqueue handling out of efivars
efi: pstore: disentangle from deprecated efivars module
efi: mokvar-table: fix some issues in new code
efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure
efivarfs: Replace invalid slashes with exclamation marks in dentries.
efi: Delete deprecated parameter comments
efi/libstub: Fix missing-prototypes in string.c
efi: Add definition of EFI_MEMORY_CPU_CRYPTO and ability to report it
cper,edac,efi: Memory Error Record: bank group/address and chip id
edac,ghes,cper: Add Row Extension to Memory Error Record
efi/x86: Add a quirk to support command line arguments on Dell EFI firmware
efi/libstub: Add efi_warn and *_once logging helpers
integrity: Load certs from the EFI MOK config table
integrity: Move import of MokListRT certs to a separate routine
efi: Support for MOK variable config table
...
encounter an MCE in kernel space but while copying from user memory by
sending them a SIGBUS on return to user space and umapping the faulty
memory, by Tony Luck and Youquan Song.
* memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
* New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
* Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the hw
eval phase and they don't make it into production.
* Misc fixes, improvements and cleanups, as always.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe
VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug
sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR
H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp
D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb
iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5
V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D
VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc
kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav
lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t
fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya
whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA=
=u1Wg
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory
by sending them a SIGBUS on return to user space and umapping the
faulty memory, by Tony Luck and Youquan Song.
- memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
- New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
- Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the
hw eval phase and they don't make it into production.
- Misc fixes, improvements and cleanups, as always.
* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
x86/mce: Decode a kernel instruction to determine if it is copying from user
x86/mce: Recover from poison found while copying from user space
x86/mce: Avoid tail copy when machine check terminated a copy from user
x86/mce: Add _ASM_EXTABLE_CPY for copy user access
x86/mce: Provide method to find out the type of an exception handler
x86/mce: Pass pointer to saved pt_regs to severity calculation routines
x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
x86/mce: Add Skylake quirk for patrol scrub reported errors
RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
x86/mce: Annotate mce_rd/wrmsrl() with noinstr
x86/mce/dev-mcelog: Do not update kflags on AMD systems
x86/mce: Stop mce_reign() from re-computing severity for every CPU
x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
x86/mce: Increase maximum number of banks to 64
x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
RAS/CEC: Fix cec_init() prototype
Shenhar.
* New AMD CPUs support, by Yazen Ghannam.
* The usual misc fixes and cleanups all over the subsystem.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EHJoACgkQEsHwGGHe
VUoZwQ//XrPtvR/ADQpFnh++WwcmbbbqA6yruBy8zyCVHveSG9dukl4pITO21CGQ
QGvOGdV3aEwDNHyrErqzWEK2o8RF28LYeTpRtaRDdH7ozk7Ua8Jqxoj7JSijz2Mv
iMh5Kn0v3+EcOMckbFBBjwf+yIPUw7/HwHDqsQ9K+F7drplnkRxU6mDyurYTTKJJ
Y00iOTp/PY7rZeHXmQAgbxvmDfv6/Xoo15ZKtzkHGtG7xW6y3n2NU2tegZizI7kA
LDx0g4lmZQZOsv2DR3ZpeLclsq9pYWCYga3P4+DJkVAAdJJ4qtk+XyKFMzyMhVOa
AyDJArSRXZsUFzbaMe1hlGknrSeXDTJHlZ7FjlpJHAudAohZlNCOEIS0VQzXwYjN
0FeR9rjinhaSOkMW1S3bBKssciCEs93a0oMASvHbzy4NGAM97YyRDZof1Iu3/jWX
20YRkSXyFhqSSRznJUh58kKk1f85B2ikySB4b3mZatPEiDJvF3I2KS/j3BFKoYE6
9JT+eP1+4FyEcobG6e8VE8BZw7+sw0A6eYisAbCLCV5oaQMdR946BEXZQKaFHDib
zKaveB4I/xbiLghYl61mMNSPtGD4czs+JR7/QEP5Q40VvwNJKjh1oHkt7hZQAgQQ
yv7JkstTpHQK/B8V6exEfBbCXQekItsVSrryPPW4gfmqncgyEVU=
=ROcp
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add Amazon's Annapurna Labs memory controller EDAC driver (Talel
Shenhar)
- New AMD CPUs support (Yazen Ghannam)
- The usual misc fixes and cleanups all over the subsystem
* tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh
EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location
EDAC/aspeed: Use module_platform_driver() to simplify
EDAC, sb_edac: Simplify switch statement
EDAC/ti: Fix handling of platform_get_irq() error
EDAC/aspeed: Fix handling of platform_get_irq() error
EDAC/i5100: Fix error handling order in i5100_init_one()
EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara
EDAC/socfpga: Transfer SoCFPGA EDAC maintainership
EDAC/thunderx: Make symbol lmc_dfs_ents static
EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver
dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding
EDAC/mce_amd: Add new error descriptions for existing types
EDAC: Replace HTTP links with HTTPS ones
AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models
70h-7Fh. The same family ops and number of channels also apply.
Use the Family17h Model 70h family_type and ops for Family 19h Models
20h-2Fh. Update the controller name to match the system.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com
Reading those sysfs entries gives:
[root@localhost /]# cat /sys/devices/system/edac/mc/mc0/max_location
memory 3 [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/dimm0/dimm_location
memory 0 [root@localhost /]#
Add newlines after the value it prints for better readability.
[ bp: Make len a signed int and change the check to catch wraparound.
Increment the pointer p only when the length check passes. Use
scnprintf(). ]
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1600051734-8993-1-git-send-email-wangxiongfeng2@huawei.com
Use module_platform_driver() which makes the code simpler by eliminating
boilerplate code.
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lkml.kernel.org/r/20200914065358.3726216-1-liushixin2@huawei.com
Updates to the UEFI 2.8 Memory Error Record allow splitting the bank field
into bank address and bank group, and using the last 3 bits of the extended
field as a chip identifier.
When needed, print correct version of bank field, bank group, and chip
identification.
Based on UEFI 2.8 Table 299. Memory Error Record.
Signed-off-by: Alex Kluver <alex.kluver@hpe.com>
Reviewed-by: Russ Anderson <russ.anderson@hpe.com>
Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200819143544.155096-3-alex.kluver@hpe.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Memory errors could be printed with incorrect row values since the DIMM
size has outgrown the 16 bit row field in the CPER structure. UEFI
Specification Version 2.8 has increased the size of row by allowing it to
use the first 2 bits from a previously reserved space within the structure.
When needed, add the extension bits to the row value printed.
Based on UEFI 2.8 Table 299. Memory Error Record
Signed-off-by: Alex Kluver <alex.kluver@hpe.com>
Tested-by: Russ Anderson <russ.anderson@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200819143544.155096-2-alex.kluver@hpe.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Commit
b972fdba86 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
didn't clear all the information from the scanned system and, more
specifically, left ghes_hw.num_dimms to its previous value. On a
second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use
the leftover num_dimms value which is not 0 and thus the 0 check in
enumerate_dimms() will get bypassed and it would go directly to the
pointer deref:
d = &hw->dimms[hw->num_dimms];
which is, of course, NULL:
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7
Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
RIP: 0010:enumerate_dimms.cold+0x7b/0x375
Reset the whole ghes_hw on driver unregister so that no stale values are
used on a second system scan.
Fixes: b972fdba86 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
Cc: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic
clang static analyzer reports this problem
sb_edac.c:959:2: warning: Undefined or garbage value
returned to caller
return type;
^~~~~~~~~~~
This is a false positive.
However by initializing the type to DEV_UNKNOWN the 3 case can be
removed from the switch, saving a comparison and jump.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200907153225.7294-1-trix@redhat.com
platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.
[ bp: Massage commit message. ]
Fixes: 86a18ee21e ("EDAC, ti: Add support for TI keystone and DRA7xx EDAC")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tero Kristo <t-kristo@ti.com>
Link: https://lkml.kernel.org/r/20200827070743.26628-2-krzk@kernel.org
platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.
[ bp: Massage commit message. ]
Fixes: 9b7e6242ee ("EDAC, aspeed: Add an Aspeed AST2500 EDAC driver")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Stefan Schaeckeler <schaecsn@gmx.net>
Link: https://lkml.kernel.org/r/20200827070743.26628-1-krzk@kernel.org
When pci_get_device_func() fails, the driver doesn't need to execute
pci_dev_put(). mci should still be freed, though, to prevent a memory
leak. When pci_enable_device() fails, the error injection PCI device
"einj" doesn't need to be disabled either.
[ bp: Massage commit message, rename label to "bail_mc_free". ]
Fixes: 52608ba205 ("i5100_edac: probe for device 19 function 0")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200826121437.31606-1-dinghao.liu@zju.edu.cn
a subsequent load can probe the system properly; by Shiju Jose.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl9LTFoACgkQEsHwGGHe
VUpAxQ//YXxtNRrGxTQa6TGcHE0wRif/EkA20O54zWGPt+kolW9mXmsUFweDi0PH
U6EW1To9H7vukfynCxc+TFGNG8oh3B6XXhaMILsNi691m418Z52a+wOHGaf/qLpX
fqJe3WIIvo97cFtJJJvdHW90NaswtIo0TUspYqUiyThjwI3a9XXmY6Vh7odtVQaP
M/tYt90By60KAR57zROifo6binSOO45zCOJiEO6Qm8X9UphXhShv6oL4TQvlEEwf
ydWSz5NZwPnnVPk4MATMIQoptrixNaDXAfmh2kLBuNNmeNLCHKGous9C7ErI+Pqo
ZH5oECXGOvJELRL3CJ/6fgb9lKaUg9z0IILsRhmoSqj6AYNcDAreWyi4wjJTaRTs
rGNzYGHgB1tlWIpFxHBoxl3u/0EaRatkfvHYnY/i9LlmbIcNQgS+vBxV8LNyCRnk
uxzJDIc8adPx+4RwmzQrnjwN6A52d50JfhVXakQykTWxTaKFg2qkEDUgpEJ4bMCr
i77H1ytr+nGuR4BL9mferTfZqSWZULiQkeDV+skKAx8sSryOrKloucDXJYUzWFDU
6MfKi4iqfNH+w0LQxDE4cRGfOID1tEz7WWI1ROflI17sD2zW+/WkgYoHj7nD2Vwy
xm5DD91Jst4rdDd53PqJhiB3sz/V9VGlDL7o+BsYzYagFP9pyfc=
=WMYq
-----END PGP SIGNATURE-----
Merge tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:
"A fix to properly clear ghes_edac driver state on driver remove so
that a subsequent load can probe the system properly (Shiju Jose)"
* tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()
After
b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes
a NULL pointer after the second ->probe() (aka ghes_edac_register())
which the config option causes to be called.
This happens because the static variable which holds down whether
the system has been scanned already, doesn't get reset in
ghes_edac_unregister(). Then, on the second probe, ghes_scan_system()
doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining
NULL.
Clear the variable and rename it to something more descriptive so that a
second probe succeeds.
[ bp: Rewrite commit message. ]
Fixes: b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com
The Extended Error Code Bitmap (xec_bitmap) for a Scalable MCA bank type
was intended to be used by the kernel to filter out invalid error codes
on a system. However, this is unnecessary after a few product releases
because the hardware will only report valid error codes. Thus, there's
no need for it with future systems.
Remove the xec_bitmap field and all references to it.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200720145353.43924-1-Yazen.Ghannam@amd.com
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.
Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").
Reverse the tests so that the error is marked fatal when RIPV is not set.
Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
Symbol 'lmc_dfs_ents' is not used outside of thunderx_edac.c, so
make it static:
drivers/edac/thunderx_edac.c:457:22: warning:
symbol 'lmc_dfs_ents' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Robert Richter <rrichter@marvell.com>
Link: https://lkml.kernel.org/r/20200714142308.46612-1-weiyongjun1@huawei.com
The Amazon's Annapurna Labs Memory Controller EDAC supports ECC capability
for error detection and correction (Single bit error correction, Double
detection). This driver introduces EDAC driver for that capability.
[ bp: Remove "EDAC" string from Kconfig tristate as it is redundant. ]
Signed-off-by: Talel Shenhar <talel@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Link: https://lkml.kernel.org/r/20200816185551.19108-3-talel@amazon.com
A few existing MCA bank types will have new error types in future SMCA
systems.
Add the descriptions for the new error types.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200708153515.1911642-1-Yazen.Ghannam@amd.com
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
[ bp: Merge all EDAC patches into a single one. ]
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tero Kristo <t-kristo@ti.com> # ti_edac
Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQW3WBGcnu5yJnSXn0kTJLX0iGMLAUCXzcrQRQcdG9ueS5sdWNr
QGludGVsLmNvbQAKCRAkTJLX0iGMLHeUAP9mEdU4W+7STNaXuzJ/5PyWbDIHPB3m
X/KXBlPRyQ5lXgD6A+IRTY3Ob7fBjs9i+sTuH5yMY6oi9YneC4lt1uqmmQc=
=fg/o
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull edac fix from Tony Luck:
"Fix for the ie31200 driver that missed the first pull"
* tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/ie31200: Fallback if host bridge device is already initialized
The Intel uncore driver may claim some of the pci ids from ie31200 which
means that the ie31200 edac driver will not initialize them as part of
pci_register_driver().
Let's add a fallback for this case to 'pci_get_device()' to get a
reference on the device such that it can still be configured. This is
similar in approach to other edac drivers.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com
e370f886fe ("EDAC: Remove edac_get_dimm_by_index()")
b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
b001694d60 ("EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt")
cb51a371d0 ("EDAC/ghes: Setup DIMM label from DMI and use it in error reports")
8807e15597 ("EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations")
e9ff6636d3 ("EDAC/mc: Call edac_inc_ue_error() before panic")
30bf38e434 ("EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier")
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQW3WBGcnu5yJnSXn0kTJLX0iGMLAUCXyNSJxQcdG9ueS5sdWNr
QGludGVsLmNvbQAKCRAkTJLX0iGMLLX+AP0RHxHv0kYI8EScLpdHFnoGAKdKJ8Zb
UmTTWqz4GwdoFgD/VIQ/RueIfsV+Hlqj+U+84+NJwPEcbJxY6WLo9ZHUkgc=
=aFwG
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Tony Luck:
"Boris is on vacation and aske me to send you the EDAC changes"
* tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC: Fix reference count leaks
EDAC: Remove edac_get_dimm_by_index()
EDAC/ghes: Scan the system once on driver init
EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt
EDAC/ghes: Setup DIMM label from DMI and use it in error reports
EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations
EDAC/mc: Call edac_inc_ue_error() before panic
EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier
- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oZ+ERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIVQ//XLP3qky/X/DOcOhdqw7JEDUpElK+sLdb
nFLcSP0vgdoip0jjz0voEf2tb0nRbZRqjnCIuHj5Dq8fTWbM/qcD0zM35N5lk5bJ
N+y9cE+FQg2oayU/0O3IezoT0lAW3YzdE2HdzhZ2ZEiDKx3fAj9QoJbKlc4TxshC
Tg93IpMq5dmnj6Ge3BfNE56O4WCLIQxs2MCIWHWjBEU8qb4Nry7vEL9hijmFQNRd
85i9pDrOkdnLDKhTagHUR4HbhkAacWeJKe0UoraJFr8/1nCm8Hcii5hI1t5N9KXg
/+asjkhLX75uYy2aMmwF7qelSLNbOieMNji7rt42vevBsqrAtX0zuSvbEBb8bNdx
f4WS5GcmUTZVHTuppPzAX3YNmYdrGqmecerAY+1j+uHtWFGs07ZCnKx+kQbiK0m4
LqaQmi16KsV/Jj9BYZu8wLd/uIAsRqnY5nJlbe7of0/kFPzlJkRkT8Z3DPhSp/ee
qHc3eR1EFMQvNr8F1e17TPCTcq5YjLM9BXFX4JJppaX30YOb6ZH0z5jYMbouJeKD
8qtQ1BtawWKqkoi/0PtbH6khRFXH6oUNWZ+2aAKQqWwfJDdP8Q7fFyIvgN487s79
u7fN4h8qj7w1AeIRJQH5v1G/Es14gdBcotgkM2AflQLF+c7QormnHdCJZad4gTAj
/X8Ks5DE51w=
=z0MD
-----END PGP SIGNATURE-----
Merge tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Ingo Molnar:
"Boris is on vacation and he asked us to send you the pending RAS bits:
- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()"
* tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce, EDAC/mce_amd: Print PPIN in machine check records
x86/mce/dev-mcelog: Use struct_size() helper in kzalloc()
x86/mce/inject: Fix a wrong assignment of i_mce.status
Commit:
da92110dfd ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
added support for F15h, model 0x60 CPUs but in doing so, missed to read
back SCRCTRL PCI config register on F15h CPUs which are *not* model
0x60. Add that read so that doing
$ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
can show the previously set DRAM scrub rate.
Fixes: da92110dfd ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
Reported-by: Anders Andersson <pipatron@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> #v4.4..
Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
When kobject_init_and_add() returns an error, it should be handled
because kobject_init_and_add() takes a reference even when it fails. If
this function returns an error, kobject_put() must be called to properly
clean up the memory associated with the object.
Therefore, replace calling kfree() and call kobject_put() and add a
missing kobject_put() in the edac_device_register_sysfs_main_kobj()
error path.
[ bp: Massage and merge into a single patch. ]
Fixes: b2ed215a33 ("Kobject: change drivers/edac to use kobject_init_and_add")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu
Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu
Change the hardware scanning and figuring out how many DIMMs a machine
has to a single, one-time thing which happens once on driver init. After
that scanning completes, struct ghes_hw_desc contains a representation
of the hardware which the driver can then use for later initialization.
Then, copy the DIMM information into the respective EDAC core
representation of those.
Get rid of ghes_edac_dimm_fill and use a struct dimm_info array
directly.
This way, hw detection and further driver initialization is nicely
and logically split. Further additions should all be added to
ghes_scan_system() and the hw representation extended as needed.
There should be no functionality change resulting from this patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
The struct members list and ghes of struct ghes_edac_pvt are unused,
remove them. On that occasion, rename it to the shorter name struct
ghes_pvt.
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com
The ghes driver reports errors with 'unknown label' even if the actual
DIMM label is known, e.g.:
EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
DRAM memory)
Fix this by using struct dimm_info's label string in error reports:
EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
DRAM memory)
The labels are initialized by reading the bank and device strings
from DMI. Now, the label information can also read from sysfs. E.g. a
ThunderX2 system will show the following:
/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
/sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
/sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
/sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
/sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
/sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
/sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
/sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
/sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
/sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
/sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
/sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
/sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
/sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
/sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
/sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0
Since dimm_labels can be rewritten, that label will be used in a later
error report:
# echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
# # some error injection here
# dmesg | grep foobar
[ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
memory)
[ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com
Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU
stepping specific configurations to {skx,i10nm}_init(), so can delete
the CPU stepping check from 10nm_init().
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti
BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+
SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2
zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx
6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK
T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L
8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs
1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z
iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9
/giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y
6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs
E76xsOr3hrAmBu4P
=1NIT
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
treewide: replace '---help---' in Kconfig files with 'help'
kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
samples: binderfs: really compile this sample and fix build issues
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
The variable ret is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant so remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com
The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.
Cc: <stable@vger.kernel.org>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reported-by: Matthew Riley <mattdr@google.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
servers with CPU stepping >=4, the offset for bus number configuration
register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
steppings use the updated 0xd0 offset.
Fix the issue by using the appropriate offset for bus number
configuration register according to the CPU model number and stepping.
Reported-by: Jerry Chen <jerry.t.chen@intel.com>
Reported-and-tested-by: Jin Wen <wen.jin@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic
The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
Fix the following gcc warning:
drivers/edac/amd8131_edac.c:47:21: warning: ‘bridge_str’ defined but not
used [-Wunused-const-variable=]
static char * const bridge_str[] = {
^~~~~~~~~~
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Robert Richter <rrichter@marvell.com>
Link: https://lkml.kernel.org/r/20200415085006.6732-1-yanaijie@huawei.com
When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).
Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.
Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
... because no one should be interested in spurious MCEs anyway. Make
the filtering unconditional and move it to amd_filter_mce().
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de
Fix the following gcc warning:
drivers/edac/xgene_edac.c:1486:7: warning: variable ‘address’ set but
not used [-Wunused-but-set-variable]
u32 address;
^~~~~~~
Remove the unused macro RBERRADDR_RD while at it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200409093259.20069-1-yanaijie@huawei.com
Fix spelling (s/Aramda/Armada/) in a log message and in a comment. While
at it, add a trailing '\n' in messages.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lkml.kernel.org/r/20200413041556.3514-1-christophe.jaillet@wanadoo.fr
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:
Kernel side changes:
- A couple of x86/cpu cleanups and changes were grandfathered in due
to patch dependencies. These clean up the set of CPU model/family
matching macros with a consistent namespace and C99 initializer
style.
- A bunch of updates to various low level PMU drivers:
* AMD Family 19h L3 uncore PMU
* Intel Tiger Lake uncore support
* misc fixes to LBR TOS sampling
- optprobe fixes
- perf/cgroup: optimize cgroup event sched-in processing
- misc cleanups and fixes
Tooling side changes are to:
- perf {annotate,expr,record,report,stat,test}
- perl scripting
- libapi, libperf and libtraceevent
- vendor events on Intel and S390, ARM cs-etm
- Intel PT updates
- Documentation changes and updates to core facilities
- misc cleanups, fixes and other enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
cpufreq/intel_pstate: Fix wrong macro conversion
x86/cpu: Cleanup the now unused CPU match macros
hwrng: via_rng: Convert to new X86 CPU match macros
crypto: Convert to new CPU match macros
ASoC: Intel: Convert to new X86 CPU match macros
powercap/intel_rapl: Convert to new X86 CPU match macros
PCI: intel-mid: Convert to new X86 CPU match macros
mmc: sdhci-acpi: Convert to new X86 CPU match macros
intel_idle: Convert to new X86 CPU match macros
extcon: axp288: Convert to new X86 CPU match macros
thermal: Convert to new X86 CPU match macros
hwmon: Convert to new X86 CPU match macros
platform/x86: Convert to new CPU match macros
EDAC: Convert to new X86 CPU match macros
cpufreq: Convert to new X86 CPU match macros
ACPI: Convert to new X86 CPU match macros
x86/platform: Convert to new CPU match macros
x86/kernel: Convert to new CPU match macros
x86/kvm: Convert to new CPU match macros
x86/perf/events: Convert to new CPU match macros
...
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
Since snprintf() returns the would-be-output size instead of the actual
output size, the succeeding calls may go beyond the given buffer limit.
Fix it by replacing with scnprintf().
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lkml.kernel.org/r/20200311071728.4541-1-tiwai@suse.de
On the ZynqMP platform, zynqmp_get_error_info() is used to read out
error information. In this function, the pinf->col parameter is not
used (it is only used by the Zynq platform's zynq_get_error_info()). So
there's no need to print pinf->col on ZynqMP.
In order to differentiate on which platform handle_error() is executed,
use DDR_ECC_INTR_SUPPORT as the check condition to distinguish between
Zynq and ZynqMP platforms.
[ bp: Massage. ]
Fixes: b500b4a029 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Manish Narani <manish.narani@xilinx.com>
Link: https://lkml.kernel.org/r/1584365679-27443-1-git-send-email-sherry.sun@nxp.com
handle_error() currently calls snprintf() a couple of times in
succession to output the message for a CE/UE, therefore overwriting each
part of the message which was formatted with the previous snprintf()
call. As a result, only the part of the message from the last snprintf()
call will be printed.
The simplest and most effective way to fix this problem is to combine
the whole string into one which to supply to a single snprintf() call.
[ bp: Massage. ]
Fixes: b500b4a029 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Cc: Manish Narani <manish.narani@xilinx.com>
Link: https://lkml.kernel.org/r/1582792452-32575-1-git-send-email-sherry.sun@nxp.com
The driver supports error detection and correction on devices with an
ARM DMC-520 memory controller.
Signed-off-by: Lei Wang <leiwang_git@outlook.com>
Signed-off-by: Shiping Ji <shiping.linux@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Link: https://lkml.kernel.org/r/83b48c70-dc06-d0d4-cae9-a2187fca628b@gmail.com
This warning is output for every virtual CPU in a guest on an EPYC 2
system because kvm doesn't enable SMCA. Once is enough too.
[ bp: Massage. ]
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200217134627.19765-1-prarit@redhat.com
Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
turns out that only the leaves in the memory hierarchy are consumed
(in sysfs), but not the intermediate layers, e.g.:
count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
These unused counters only add complexity, remove them. The error
counter values are directly stored in struct dimm_info now.
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-11-rrichter@marvell.com
The error descriptor is passed to the error reporting functions, so
the error details can be directly generated there. Move string
generation from edac_raw_mc_handle_error() to edac_ce_error() and
edac_ue_error(). The intermediate detail[] string can be removed then.
Also, cleanup the string generation by switching to a single variant
only using the ternary operator.
[ bp: put ternary operators on a separate line for better readability
and use the short-form "inline if" in edac_mc_handle_error(). ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-10-rrichter@marvell.com
Most arguments of error reporting functions are already stored in the
struct edac_raw_error_desc error descriptor. Pass the error descriptor
to the functions and reduce the functions' argument list.
[ bp: Sort function args in reverse fir tree order. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-9-rrichter@marvell.com
Many functions carry the enable_per_layer_report argument. This is a
bool value indicating the error information contains some location
data where the error occurred. This can easily being determined by
checking the pos[] array for values. Negative values indicate there is
no location available. So if the top layer is negative, the error
location is unknown.
Just check if the top layer is negative and remove
enable_per_layer_report as function argument and also from struct
edac_raw_error_desc.
[ bp: Reflow comments to 80 columns, while at it. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-8-rrichter@marvell.com
There is a limitation to report only EDAC_MAX_LABELS in e->label of
the error descriptor. This is to prevent a potential string overflow.
The current implementation falls back to "any memory" in this case and
also stops all further processing to find a unique row and channel of
the possible error location.
Reporting "any memory" is wrong as the memory controller reported an
error location for one of the layers. Instead, report "unknown memory"
and also do not break early in the loop to further check row and channel
for uniqueness.
[ bp: Massage commit message. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-7-rrichter@marvell.com
Carve out the error_count increment into a separate function
edac_inc_csrow(). This better separates code and reduces the indentation
level.
Implementation note: The function edac_inc_csrow() counts the same
as before, ->ce_count is only incremented if row >= 0. This is esp.
true for the case of (!e->enable_per_layer_report). Here, a DIMM was
not found, variable row still has a value of -1 and ->ce_count is not
incremented.
[ bp: Massage commit message. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200214141757.8976-1-rrichter@marvell.com
Each struct mci has its own error descriptor. Create a function
error_desc_to_mci() to determine the corresponding mci from an
error descriptor. This removes @mci from the parameter list of
edac_raw_mc_handle_error() as the mci pointer does not need to be passed
any longer.
[ bp: Massage commit message. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-5-rrichter@marvell.com
Store the error type in struct edac_raw_error_desc. This makes the
type parameter of edac_raw_mc_handle_error() obsolete.
[ kernel-doc typo ]
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-4-rrichter@marvell.com
Reorder the new created functions edac_mc_alloc_csrows() and
edac_mc_alloc_dimms() and move them before edac_mc_alloc(). No further
code changes.
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-3-rrichter@marvell.com
edac_mc_alloc() is huge. Factor out code by moving it to the two new
functions edac_mc_alloc_csrows() and edac_mc_alloc_dimms(). Do not
move code yet for better review.
[ bp: sort local args in reversed fir tree order. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200123090210.26933-2-rrichter@marvell.com
There are dimm and csrow devices linked to the mci device esp. to show
up in sysfs. It must be granted that children devices are removed before
its mci parent. Thus, the release functions must be called in the
correct order and may not miss any child before releasing its parent. In
the current implementation this is only granted by the correct order of
release functions.
A much better approach is to use put_device() that releases the device
only after all users are gone. It is the recommended way to release a
device and free its memory. The function uses the device's refcount and
only frees it if there are no users of it anymore such as children.
So implement a mci_release() function to remove mci devices, use
put_device() to free them and early initialize the mci device right
after its struct has been allocated.
Change the release function so that it can be universally used no
matter if the device is registered or not. Since subsequent dimm
and csrow sysfs links are implemented as children devices, their
refcounts will keep the parent mci device from being removed as long
as sysfs entries exist and until all users have been unregistered in
edac_remove_sysfs_mci_device().
Remove edac_unregister_sysfs() and merge mci sysfs removal into
edac_remove_sysfs_mci_device(). There is only a single instance now that
removes the sysfs entries. The function can now be used in the error
paths for cleanup.
Also, create device release functions for all involved devices
(dev->release), remove device_type release functions (dev_type->
release) and also use dev->init_name instead of dev_set_name().
[ bp: Massage commit message and comments. ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Link: https://lkml.kernel.org/r/20200212120340.4764-5-rrichter@marvell.com
All created csrow objects must be removed in the error path of
edac_create_csrow_objects(). The objects have been added as devices.
They need to be removed by doing a device_del() *and* put_device() call
to also free their memory. The missing put_device() leaves a memory
leak. Use device_unregister() instead of device_del() which properly
unregisters the device doing both.
Fixes: 7adc05d2dc ("EDAC/sysfs: Drop device references properly")
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200212120340.4764-4-rrichter@marvell.com
A test kernel with the options DEBUG_TEST_DRIVER_REMOVE, KASAN and
DEBUG_KMEMLEAK set, revealed several issues when removing an mci device:
1) Use-after-free:
On 27.11.19 17:07:33, John Garry wrote:
> [ 22.104498] BUG: KASAN: use-after-free in
> edac_remove_sysfs_mci_device+0x148/0x180
The use-after-free is caused by the mci_for_each_dimm() macro called in
edac_remove_sysfs_mci_device(). The iterator was introduced with
c498afaf7d ("EDAC: Introduce an mci_for_each_dimm() iterator").
The iterator loop calls device_unregister(&dimm->dev), which removes
the sysfs entry of the device, but also frees the dimm struct in
dimm_attr_release(). When incrementing the loop in mci_for_each_dimm(),
the dimm struct is accessed again, after having been freed already.
The fix is to free all the mci device's subsequent dimm and csrow
objects at a later point, in _edac_mc_free(), when the mci device itself
is being freed.
This keeps the data structures intact and the mci device can be
fully used until its removal. The change allows the safe usage of
mci_for_each_dimm() to release dimm devices from sysfs.
2) Memory leaks:
Following memory leaks have been detected:
# grep edac /sys/kernel/debug/kmemleak | sort | uniq -c
1 [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0 # mci->csrows
16 [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0 # csr->channels
16 [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0 # csr->channels[chn]
1 [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0 # mci->dimms
34 [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc()
All leaks are from memory allocated by edac_mc_alloc().
Note: The test above shows that edac_mc_alloc() was called here from
ghes_edac_register(), thus both functions show up in the stack trace
but the module causing the leaks is edac_mc. The comments with the data
structures involved were made manually by analyzing the objdump.
The data structures listed above and created by edac_mc_alloc() are
not properly removed during device removal, which is done in
edac_mc_free().
There are two paths implemented to remove the device depending on device
registration, _edac_mc_free() is called if the device is not registered
and edac_unregister_sysfs() otherwise.
The implemenations differ. For the sysfs case, the mci device removal
lacks the removal of subsequent data structures (csrows, channels,
dimms). This causes the memory leaks (see mci_attr_release()).
[ bp: Massage commit message. ]
Fixes: c498afaf7d ("EDAC: Introduce an mci_for_each_dimm() iterator")
Fixes: faa2ad09c0 ("edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs.")
Fixes: 7a623c0390 ("edac: rewrite the sysfs code to use struct device")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200212120340.4764-3-rrichter@marvell.com
- remove ioremap_nocache given that is is equivalent to
ioremap everywhere
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl4vKHwLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMPGBAAuVNUZaZfWYHpiVP2oRcUQUguFiD3NTbknsyzV2oH
J9P0GfeENSKwE9OOhZ7XIjnCZAJwQgTK/ppQY5yiQ/KAtYyyXjXEJ6jqqjiTDInr
+3+I3t/LhkgrK7tMrb7ylTGa/d7KhaciljnOXC8+b75iddvM9I1z2pbHDbppZMS9
wT4RXL/cFtRb85AfOyPLybcka3f5P2gGvQz38qyimhJYEzHDXZu9VO1Bd20f8+Xf
eLBKX0o6yWMhcaPLma8tm0M0zaXHEfLHUKLSOkiOk+eHTWBZ3b/w5nsOQZYZ7uQp
25yaClbameAn7k5dHajduLGEJv//ZjLRWcN3HJWJ5vzO111aHhswpE7JgTZJSVWI
ggCVkytD3ESXapvswmACSeCIDMmiJMzvn6JvwuSMVB7a6e5mcqTuGo/FN+DrBF/R
IP+/gY/T7zIIOaljhQVkiEIIwiD/akYo0V9fheHTBnqcKEDTHV4WjKbeF6aCwcO+
b8inHyXZSKSMG//UlDuN84/KH/o1l62oKaB1uDIYrrL8JVyjAxctWt3GOt5KgSFq
wVz1lMw4kIvWtC/Sy2H4oB+RtODLp6yJDqmvmPkeJwKDUcd/1JKf0KsZ8j3FpGei
/rEkBEss0KBKyFAgBSRO2jIpdj2epgcBcsdB/r5mlhcn8L77AS6mHbA173kY4pQ/
Kdg=
=TUCJ
-----END PGP SIGNATURE-----
Merge tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap
Pull ioremap updates from Christoph Hellwig:
"Remove the ioremap_nocache API (plus wrappers) that are always
identical to ioremap"
* tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap:
remove ioremap_nocache and devm_ioremap_nocache
MIPS: define ioremap_nocache to ioremap
Pull RAS updates from Borislav Petkov:
- Misc fixes to the MCE code all over the place, by Jan H. Schönherr.
- Initial support for AMD F19h and other cleanups to amd64_edac, by
Yazen Ghannam.
- Other small cleanups.
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Make fam_ops static global
EDAC/amd64: Drop some family checks for newer systems
EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
x86/amd_nb: Add Family 19h PCI IDs
EDAC/mce_amd: Always load on SMCA systems
x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
x86/mce: Fix use of uninitialized MCE message string
x86/mce: Fix mce=nobootlog
x86/mce: Take action on UCNA/Deferred errors again
x86/mce: Remove mce_inject_log() in favor of mce_log()
x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl4uvKUACgkQEsHwGGHe
VUoG5RAAoSdq9LvBPWTtJUMckxZ124hAqvtDOqvAbCnde3V2yICZLfOuGZSuWBy0
+ED2Ebwnf6mxzB2vEEI1OQ0+LmmRXvc+UbI2z7SnC/4sLZnS7h9wDrxN41nR0dyB
cMx0JB0ClyMW71WJo4Jbp9B+fiB4rBu/gmrXc0kC3L6ShmYvWzGdrPIV31qXPPYn
zjqOn+tBENi7J+8omVsFg/KtAK9BTEk7FLuAz2mevLJzZGwycQBk/sigDHTuXJC2
AACPmaQAqZgOYEOWID+F1USgD7x3yfYEZUBQWokqmDJDkN6agvcScQWoYh2PGobR
wxJhGZR+Uq3NU0BsdoKlI4OCI/Hf+WYBAhpSdlu1/Wle3Wq4wGc91wisFtnpXU6s
2VfiUtpf9WTROUeyHvhXcKgnjTwmRxSOBmaIhSYSL/aQ34AJaMmaeS8Vj1vpamrp
1dotRIQKmjI8yDRs4JBfsJX3KtMjJ+u1qiz7SfwqBv1dqypSamTkvjpbPhItUrAz
Hv9QHWQAOmCLBqWw3Jx2W9gwd7Z1gYmpvo9mqyUy81eF/BDNeJTi1zRkNL2jqElU
gu6gaOheYr7UumA9LDsZUqh3cGgGii73pp2Fc1A2QvL9HeD++UVoXOlwjo8VEs1b
+u+DRDKNy3OwgCLMGfK0zuWfvTRqgV+OnlPnBH2/2PnYpE9tawU=
=fR3M
-----END PGP SIGNATURE-----
Merge tag 'edac_for_5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
"A totally boring branch this time around: a garden variety of small
fixes all over the place"
* tag 'edac_for_5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Do not warn when removing instances
EDAC/sifive: Fix return value check in ecc_register()
EDAC/aspeed: Remove unneeded semicolon
EDAC: remove set but not used variable 'ecc_loc'
EDAC: skx_common: downgrade message importance on missing PCI device
EDAC/Kconfig: Fix Kconfig indentation
On machines which do not populate all nodes with DIMMs, the driver
doesn't initialize an instance there. However, the instance removal
remove_one_instance() path will warn unconditionally, which is wrong.
Remove the WARN_ON() even if the warning is innocent because it causes a
splat in dmesg.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200117115939.5524-1-bp@alien8.de
In case of error, the function edac_device_alloc_ctl_info() returns a
NULL pointer, not ERR_PTR(). Replace the IS_ERR() test in the return
value check with a NULL test.
Fixes: 91abaeaaff ("EDAC/sifive: Add EDAC platform driver for SiFive SoCs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200115150303.112627-1-weiyongjun1@huawei.com