amd_cache_northbridges() is exported by amd_nb.c and is called by
amd64-agp.c and amd64_edac.c modules at module_init() time so that NB
descriptors are properly cached before those drivers can use them.
However, the init_amd_nbs() initcall already does call
amd_cache_northbridges() unconditionally and thus makes sure the NB
descriptors are enumerated.
That initcall is a fs_initcall type which is on the 5th group (starting
from 0) of initcalls that gets run in increasing numerical order by the
init code.
The module_init() call is turned into an __initcall() in the MODULE=n
case and those are device-level initcalls, i.e., group 6.
Therefore, the northbridges caching is already finished by the time
module initialization starts and thus the correct initialization order
is retained.
Unexport amd_cache_northbridges(), update dependent modules to
call amd_nb_num() instead. While at it, simplify the checks in
amd_cache_northbridges().
[ bp: Heavily massage and *actually* explain why the change is ok. ]
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220324122729.221765-1-nchatrad@amd.com
Introduce a "family flags" bitmask that can be used to indicate any
special behavior needed on a per-family basis.
Add a flag to indicate a system uses the new register offsets introduced
with Family 19h Model 10h.
Use this flag to account for register offset changes, a new bitfield
indicating DDR5 use on a memory controller, and to set the proper number
of chip select masks.
Rework f17_addr_mask_to_cs_size() to properly handle the change in chip
select masks. And update code comments to reflect the updated Chip
Select, DIMM, and Mask relationships.
[uninitialized variable warning]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: William Roche <william.roche@oracle.com>
Link: https://lore.kernel.org/r/20220202144307.2678405-3-yazen.ghannam@amd.com
Current AMD systems allow mixing of DIMM types within a system. However,
DIMMs within a channel, i.e. managed by a single Unified Memory
Controller (UMC), must be of the same type.
Handle this possible configuration by checking and setting the memory
type for each individual "UMC" structure.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: William Roche <william.roche@oracle.com>
Link: https://lore.kernel.org/r/20220202144307.2678405-2-yazen.ghannam@amd.com
- Add support for DRR5 and new models 0x10-0x1f and 0x50-0x5f of AMD
family 0x19 CPUs to amd64_edac
- The usual set of fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHb/PwACgkQEsHwGGHe
VUqhbhAAo0mRNnBF3CJn1zlXRgmqrvV1IPnJQNp+z5iaXY1vr0qRMgO4OcgsJrxF
nxrx/fdAYlQQO4vz1iq4t4j+eazOQyM/JZ0DKi4e+Dw2mC0axdCx8a0pyl1g2de6
oQ5GkplRKUFn+3bTJpHIE5QnCOD7S85Mrp1F3Soa6jD9i+HwQIqAoltNMcCP7Yei
ibhWUBX2H/oYcHARecIkP/YEyzSEHhcX6LRjNILW5haZQ6GziQUFzKUUwpUS3hsz
9i6hXnHXEPhOq8JyoyWWhvVDywFK9z8lh57G7DFfZIhAk1FjuLDP2iI270D/LkYF
shq6+M8ST9yqwOMV3Iaoa8VZFf/fjTyV0E0L2p2+faxaJ66rqdzbagLIZQv6hDNe
N1/LD72/Io4et1kEbbaHm5jpxzSJ0jQwu1o+rY1/NmKsWhzE6V4X0GnDTZzZwP9b
CbFJAWdCD+fi3WQjzv8HLVepjIsV+R3VVTOLq2oodn/mtoK0DRU/vTeCCwRS9ntF
IyF55L/jSqy9CtP119KBnItGo4b84UrJDozXizGtc6Zt3chz7ljSpa2gJrKF+fCR
Yhyr9Pt+vYQBpnIMDu1BPcoE58pwZYKoOSO00COUHHsLn+u8qhetGmQzYwWHCq7J
hz8HHZnlTdFseZTp5tavk3B4md5HiPu+A7GevK3YH3lBv/vEmDM=
=85ZE
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add support for version 3 of the Synopsys DDR controller to
synopsys_edac
- Add support for DRR5 and new models 0x10-0x1f and 0x50-0x5f of AMD
family 0x19 CPUs to amd64_edac
- The usual set of fixes and cleanups
* tag 'edac_updates_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Add support for family 19h, models 50h-5fh
EDAC/sb_edac: Remove redundant initialization of variable rc
RAS/CEC: Remove a repeated 'an' in a comment
EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh
EDAC: Add RDDR5 and LRDDR5 memory types
EDAC/sifive: Fix non-kernel-doc comment
dt-bindings: memory: Add entry for version 3.80a
EDAC/synopsys: Enable the driver on Intel's N5X platform
EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR
EDAC/synopsys: Use the quirk for version instead of ddr version
Add the new family 19h models 50h-5fh PCI IDs (device 18h functions 0
and 6) to support Ryzen 5000 APUs ("Cezanne").
Signed-off-by: Marc Bevand <m@zorinaq.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20211221233112.556927-1-m@zorinaq.com
Add a new family type for AMD Family 19h Models 10h to 1Fh. Use this new
family type for Models A0h to AFh also.
Increase the maximum number of controllers from 8 to 12.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208174356.1997855-3-yazen.ghannam@amd.com
Define an address translation context struct. This will hold values that
will be passed between multiple functions.
Save return address, Node ID, and the Instance ID number to start.
Currently, the UMC number is used as the Instance ID, but future DF
versions may use another value.
Also include a "tmp" field to use when reading registers. This is to
avoid having to define a temporary variable in multiple functions.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211028175728.121452-5-yazen.ghannam@amd.com
The DF Indirect Access method allows for "Broadcast" accesses in which
case no specific instance is targeted. Add support using a reserved
instance ID of 0xFF to indicate a broadcast access. Set the FICAA
register appropriately.
Define helpers functions for instance and broadcast reads and use them
where appropriate.
Drop the "amd_" prefix since these functions are all static.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211028175728.121452-4-yazen.ghannam@amd.com
AMD Rome systems and later support interleaving between three identical
ranks within a channel.
Check for this mode by counting the number of enabled chip selects and
comparing their masks. If there are exactly three enabled chip selects
and their masks are identical, then three rank interleaving is enabled.
The size of a rank is determined from its mask value. However, three
rank interleaving doesn't follow the method of swapping an interleave
bit with the most significant bit. Rather, the interleave bit is flipped
and the most significant bit remains the same. There is only a single
interleave bit in this case.
Account for this when determining the chip select size by keeping the
most significant bit at its original value and ignoring any zero bits.
This will return a full bitmask in [MSB:1].
Fixes: e53a3b267f ("EDAC/amd64: Find Chip Select memory size using Address Mask")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211005154419.2060504-1-yazen.ghannam@amd.com
Instead of "open coding" DEVICE_ATTR, use the corresponding
helper macros DEVICE_ATTR_{RW,RO,WO} in amd64_edac.c
Some function names needed to be changed to match the device
conventions <foo>_show and <foo>_store, but the functionality
itself is unchanged.
The devices using EDAC_DCT_ATTR_SHOW() are left unchanged.
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210713065130.2151-1-dwaipayanray1@gmail.com
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
amd64_edac was converted to CPU family autoprobing (from PCI device
IDs) to not have to add a new PCI device ID each time a new platform is
shipped but to support the whole family out-of-the-box.
However, this caused a lot of noise in dmesg even when the machine
doesn't have ECC DIMMs or ECC has been disabled in the BIOS:
EDAC MC: Ver: 3.0.0
EDAC amd64: F17h detected (node 0).
EDAC amd64: Node 0: DRAM ECC disabled.
EDAC amd64: F17h detected (node 1).
EDAC amd64: Node 1: DRAM ECC disabled.
EDAC amd64: F17h detected (node 2).
EDAC amd64: Node 2: DRAM ECC disabled.
EDAC amd64: F17h detected (node 3).
EDAC amd64: Node 3: DRAM ECC disabled.
EDAC amd64: F17h detected (node 4).
EDAC amd64: Node 4: DRAM ECC disabled.
EDAC amd64: F17h detected (node 5).
EDAC amd64: Node 5: DRAM ECC disabled.
EDAC amd64: F17h detected (node 6).
EDAC amd64: Node 6: DRAM ECC disabled.
EDAC amd64: F17h detected (node 7).
EDAC amd64: Node 7: DRAM ECC disabled.
or even
$ grep EDAC dmesg.log | sed 's/\[.*\] //' | sort | uniq -c
128 EDAC amd64: F17h detected (node 0).
128 EDAC amd64: Node 0: DRAM ECC disabled.
1 EDAC MC: Ver: 3.0.0
on a big machine. Yap, that's once per CPU for 128 of them.
So move the init messages after all probing has succeeded to avoid
unnecessary spew in dmesg.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210119164141.17417-1-bp@alien8.de
Families up to and including 0x16 allow access to the injection
hardware. Starting with family 0x17, access to those registers is
blocked by security policy.
Limit that only on the families which support it.
Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201222180013.GD13463@zn.tnic
Merge them into the main driver and put them inside an EDAC_DEBUG
ifdeffery to simplify the driver and have all debugging/injection stuff
behind a debug build-time switch.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20201215110517.5215-2-bp@alien8.de
There's no need for them to be in a separate file so merge them into the
main driver compilation unit like the other EDAC drivers do.
Drop now-unneeded function export, make the function static and shorten
static function names.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20201215110517.5215-1-bp@alien8.de
Give these messages a debug severity as they are really only useful to
the module developers.
Also, drop the "(broken BIOS?)" phrase, since this can cause churn for
BIOS folks. The PCI IDs needed by the module, at least on modern systems,
are fixed in hardware.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201215170131.8496-1-Yazen.Ghannam@amd.com
Those were only laptops and are very very unlikely to have ECC memory.
Currently, when the driver attempts to load, it issues:
EDAC amd64: Error: F1 not found: device 0x1601 (broken BIOS?)
because the PCI device is the wrong one (it uses the F15h default one).
So do not load the driver on them as that is pointless.
Reported-by: Don Curtis <bugrprt21882@online.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Don Curtis <bugrprt21882@online.de>
Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1179763
Link: https://lkml.kernel.org/r/20201218160622.20146-1-bp@alien8.de
code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVlYACgkQEsHwGGHe
VUpxTA/9F0KsgSyTh66uX+aX5qkQ3WTBVgxbXGFrn5qPvwcALXabU8qObDWTSdwS
1YbiWDjKNBJX+dggWe/fcQgUZxu5DFkM4IKEW1V7MLJEcdfylcqCyc1YNpEI4ySn
ebw2Sy4/5iXGAvhz802/WoUU/o3A2uZwe0RFyodHGxof5027HkZhRHeYB27Htw+l
z0IsmiYOoPl/4mNuVgr/qieIFSw1SUE9kwjU8RvM6xVWmXWXpM68JHa9s+/51pFt
6BaOz485OyzWUCtSx3/++GEkU2d53bWYOuQ1zTLEiuaBfYC5n5T/kAcT4WJNK6Tf
tX7yrzmWm9ecykIxfkgMrhG57G38y2GMJcEg+dFQHeXC062fdHDg+oY6Ql2EkAm5
t5RIQ/cyOmQCLns31rHI/kwQ3RMKc/lfnL/z8lrlfWsC5o755yFJKttbfLJugbTo
3BO1fbs4xgQcgi0KoqXOUETrQtsOLtr9FJwvcArB94XXqcIPClE8Ir7n8T7FCuLr
9litSXIdn46EHwD6hD5QIk7y+Rxwk/jxZFys3eh90jcWDDZTaG2lz3if33RbZ1go
XBrS5X3HsMODGZlaMeUjrbFIz3e0Zyoo+RO/TX48w8nzivC6xSNxSNFgIZ1XTF5E
SLMGa6lEQ9mLiqRfgFjynNwSYOSlGv3euMkZaVPS3hnNmn+vZbI=
=RsCs
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
"Only AMD-specific changes this time:
- Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and
convert all code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD
(Arvind Sankar)"
* tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Remove dead code for TSEG region remapping
x86/topology: Set cpu_die_id only if DIE_TYPE found
EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
x86/CPU/AMD: Remove amd_get_nb_id()
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
In order to setup its PCI component, the driver needs any node private
instance in order to get a reference to the PCI device and hand that
into edac_pci_create_generic_ctl(). For convenience, it uses the 0th
memory controller descriptor under the assumption that if any, the 0th
will be always present.
However, this assumption goes wrong when the 0th node doesn't have
memory and the driver doesn't initialize an instance for it:
EDAC amd64: F17h detected (node 0).
...
EDAC amd64: Node 0: No DIMMs detected.
But looking up node instances is not really needed - all one needs is
the pointer to the proper device which gets discovered during instance
init.
So stash that pointer into a variable and use it when setting up the
EDAC PCI component.
Clear that variable when the driver needs to unwind due to some
instances failing init to avoid any registration imbalance.
Cc: <stable@vger.kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201122150815.13808-1-bp@alien8.de
The Last Level Cache ID is returned by amd_get_nb_id(). In practice,
this value is the same as the AMD NodeId for callers of this function.
The NodeId is saved in struct cpuinfo_x86.cpu_die_id.
Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and
remove the function.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
Shenhar.
* New AMD CPUs support, by Yazen Ghannam.
* The usual misc fixes and cleanups all over the subsystem.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EHJoACgkQEsHwGGHe
VUoZwQ//XrPtvR/ADQpFnh++WwcmbbbqA6yruBy8zyCVHveSG9dukl4pITO21CGQ
QGvOGdV3aEwDNHyrErqzWEK2o8RF28LYeTpRtaRDdH7ozk7Ua8Jqxoj7JSijz2Mv
iMh5Kn0v3+EcOMckbFBBjwf+yIPUw7/HwHDqsQ9K+F7drplnkRxU6mDyurYTTKJJ
Y00iOTp/PY7rZeHXmQAgbxvmDfv6/Xoo15ZKtzkHGtG7xW6y3n2NU2tegZizI7kA
LDx0g4lmZQZOsv2DR3ZpeLclsq9pYWCYga3P4+DJkVAAdJJ4qtk+XyKFMzyMhVOa
AyDJArSRXZsUFzbaMe1hlGknrSeXDTJHlZ7FjlpJHAudAohZlNCOEIS0VQzXwYjN
0FeR9rjinhaSOkMW1S3bBKssciCEs93a0oMASvHbzy4NGAM97YyRDZof1Iu3/jWX
20YRkSXyFhqSSRznJUh58kKk1f85B2ikySB4b3mZatPEiDJvF3I2KS/j3BFKoYE6
9JT+eP1+4FyEcobG6e8VE8BZw7+sw0A6eYisAbCLCV5oaQMdR946BEXZQKaFHDib
zKaveB4I/xbiLghYl61mMNSPtGD4czs+JR7/QEP5Q40VvwNJKjh1oHkt7hZQAgQQ
yv7JkstTpHQK/B8V6exEfBbCXQekItsVSrryPPW4gfmqncgyEVU=
=ROcp
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add Amazon's Annapurna Labs memory controller EDAC driver (Talel
Shenhar)
- New AMD CPUs support (Yazen Ghannam)
- The usual misc fixes and cleanups all over the subsystem
* tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh
EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location
EDAC/aspeed: Use module_platform_driver() to simplify
EDAC, sb_edac: Simplify switch statement
EDAC/ti: Fix handling of platform_get_irq() error
EDAC/aspeed: Fix handling of platform_get_irq() error
EDAC/i5100: Fix error handling order in i5100_init_one()
EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara
EDAC/socfpga: Transfer SoCFPGA EDAC maintainership
EDAC/thunderx: Make symbol lmc_dfs_ents static
EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver
dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding
EDAC/mce_amd: Add new error descriptions for existing types
EDAC: Replace HTTP links with HTTPS ones
AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models
70h-7Fh. The same family ops and number of channels also apply.
Use the Family17h Model 70h family_type and ops for Family 19h Models
20h-2Fh. Update the controller name to match the system.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com
Commit:
da92110dfd ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
added support for F15h, model 0x60 CPUs but in doing so, missed to read
back SCRCTRL PCI config register on F15h CPUs which are *not* model
0x60. Add that read so that doing
$ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
can show the previously set DRAM scrub rate.
Fixes: da92110dfd ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
Reported-by: Anders Andersson <pipatron@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> #v4.4..
Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
The variable ret is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant so remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com
... because no one should be interested in spurious MCEs anyway. Make
the filtering unconditional and move it to amd_filter_mce().
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
Pull RAS updates from Borislav Petkov:
- Misc fixes to the MCE code all over the place, by Jan H. Schönherr.
- Initial support for AMD F19h and other cleanups to amd64_edac, by
Yazen Ghannam.
- Other small cleanups.
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Make fam_ops static global
EDAC/amd64: Drop some family checks for newer systems
EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
x86/amd_nb: Add Family 19h PCI IDs
EDAC/mce_amd: Always load on SMCA systems
x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
x86/mce: Fix use of uninitialized MCE message string
x86/mce: Fix mce=nobootlog
x86/mce: Take action on UCNA/Deferred errors again
x86/mce: Remove mce_inject_log() in favor of mce_log()
x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
On machines which do not populate all nodes with DIMMs, the driver
doesn't initialize an instance there. However, the instance removal
remove_one_instance() path will warn unconditionally, which is wrong.
Remove the WARN_ON() even if the warning is innocent because it causes a
splat in dmesg.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200117115939.5524-1-bp@alien8.de
In general, "pvt->umc != NULL" is used to check if the system is Family
17h+. However, there are a few places that are using direct family
checks.
Replace the remaining family checks with a check for "pvt->umc != NULL".
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-6-Yazen.Ghannam@amd.com
Add family ops to support AMD Family 19h systems. Existing Family 17h
functions can be used. Also, add Family 19h to the list of families to
automatically load the module.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-5-Yazen.Ghannam@amd.com
This message keeps flooding dmesg on boxes where ECC is disabled or the
DIMMs do not support ECC but the module gets auto-probed. What's even
worse is that autoprobing happens on every CPU due to the CPU-family
matching the driver does and uevent being generated for each CPU device.
What is more, this message is becoming even more useless on newer
systems where forcing ECC is not recommended and it should be done in
the BIOS so the BIOS can do all the necessary work, i.e., just setting a
bit in an MSR is not enough anymore.
So get rid of it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20191106160607.GC28380@zn.tnic
Return early before checking for ECC if the node does not have any
populated memory.
Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.
Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.
Return an error code when ECC is disabled. This check happens after
checking for memory. The module should explicitly fail to load if memory
is populated on a node and ECC is disabled.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-6-Yazen.Ghannam@amd.com
...now that the data is available earlier.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-5-Yazen.Ghannam@amd.com
The maximum number of memory controllers is fixed within a family/model
group. In most cases, this has been fixed at 2, but some systems may
have up to 8.
The struct amd64_family_type already contains family/model-specific
information, and this can be used rather than adding model checks to
various functions.
Create a new field in struct amd64_family_type for max_mcs.
Set this when setting other family type information, and use this when
needing the maximum number of memory controllers possible for a system.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-4-Yazen.Ghannam@amd.com
Split out gathering hardware information from init_one_instance()
into a separate function hw_info_get(). This is necessary so that
the information can be cached earlier and used to check if memory is
populated and if ECC is enabled on a node.
Also, define a function hw_info_put() to back out changes made in
hw_info_get().
Check for an allocated PCI device (Function 0 for Family 17h or Function
1 for pre-Family 17h) before freeing, since hw_info_put() may be called
before PCI siblings are reserved.
Drop the family check when freeing pvt->umc. This will be NULL on
pre-Family 17h systems. However, kfree() is safe and will check for a
NULL pointer before freeing.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-3-Yazen.Ghannam@amd.com
The struct amd64_family_type doesn't change between multiple nodes and
instances of the module, so make it global.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-2-Yazen.Ghannam@amd.com
The following commit introduced a warning on error reports without a
non-zero grain value.
3724ace582 ("EDAC/mc: Fix grain_bits calculation")
The amd64_edac_mod module does not provide a value, so the warning will
be given on the first reported memory error.
Set the grain per DIMM to cacheline size (64 bytes). This is the current
recommendation.
Fixes: 3724ace582 ("EDAC/mc: Fix grain_bits calculation")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191022203448.13962-7-Yazen.Ghannam@amd.com
Add the new Family 17h Model 70h PCI IDs (device 18h functions 0 and 6)
to the AMD64 EDAC module.
[ bp: s/f17_base_addr_to_cs_size/f17_addr_mask_to_cs_size/g ]
Signed-off-by: Isaac Vaughn <isaac.vaughn@knights.ucf.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Cc: linux-edac@vger.kernel.org
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190906192131.8ced0ca112146f32d82b6cae@knights.ucf.edu
Future AMD systems will support asymmetric dual-rank DIMMs. These are
DIMMs where the ranks are of different sizes.
The even rank will use the Primary Even Chip Select registers and the
odd rank will use the Secondary Odd Chip Select registers.
Recognize if a Secondary Odd Chip Select is being used. Use the
Secondary Odd Address Mask when calculating the chip select size.
[ bp: move csrow_sec_enabled() to the header, fix CS_ODD define and
tone-down the capitalized words spelling. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-8-Yazen.Ghannam@amd.com
AMD Family 17h systems have a set of secondary Chip Select Base
Addresses and Address Masks. These do not represent unique Chip
Selects, rather they are used in conjunction with the primary
Chip Select registers in certain cases.
Cache these secondary Chip Select registers for future use.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-7-Yazen.Ghannam@amd.com
AMD Family 17h systems currently require address translation in order to
report the system address of a DRAM ECC error. This is currently done
before decoding the syndrome information. The syndrome information does
not depend on the address translation, so the proper EDAC csrow/channel
reporting can function without the address. However, the syndrome
information will not be decoded if the address translation fails.
Decode the syndrome information before doing the address translation.
The syndrome information is architecturally defined in MCA_SYND and can
be considered robust. The address translation is system-specific and may
fail on newer systems without proper updates to the translation
algorithm.
Fixes: 713ad54675 ("EDAC, amd64: Define and register UMC error decode function")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-6-Yazen.Ghannam@amd.com
Chip Select memory size reporting on AMD Family 17h was recently fixed
in order to account for interleaving. However, the current method is not
robust.
The Chip Select Address Mask can be used to find the memory size. There
are a couple of cases.
1) For single-rank and dual-rank non-interleaved, use the address mask
plus 1 as the size.
2) For dual-rank interleaved, do #1 but "de-interleave" the address mask
first.
Always "de-interleave" the address mask in order to simplify the code
flow. Bit mask manipulation is necessary to check for interleaving, so
just go ahead and do the de-interleaving. In the non-interleaved case,
the original and de-interleaved address masks will be the same.
To de-interleave the mask, count the number of zero bits in the middle
of the mask and swap them with the most significant bits.
For example,
Original=0xFFFF9FE, De-interleaved=0x3FFFFFE
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-5-Yazen.Ghannam@amd.com
Currently, the DIMM info for AMD Family 17h systems is initialized in
init_csrows(). This function is shared with legacy systems, and it has a
limit of two channel support.
This prevents initialization of the DIMM info for a number of ranks, so
there will be missing ranks in the EDAC sysfs.
Create a new init_csrows_df() for Family17h+ and revert init_csrows()
back to pre-Family17h support.
Loop over all channels in the new function in order to support systems
with more than two channels.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-4-Yazen.Ghannam@amd.com
AMD Family 17h systems support x4 and x16 DRAM devices. However, the
device type is not checked when setting mci.edac_ctl_cap.
Set the appropriate capability flag based on the device type.
Default to x8 DRAM device when neither the x4 or x16 bits are set.
[ bp: reverse cpk_en check to save an indentation level. ]
Fixes: 2d09d8f301 ("EDAC, amd64: Determine EDAC MC capabilities on Fam17h")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-3-Yazen.Ghannam@amd.com