Commit Graph

1908 Commits

Author SHA1 Message Date
Linus Torvalds bb511d4b25 Intel EDAC fixes:
- Old igen6 driver could lose pending events during initialization
 - Sapphire Rapids workstations have fewer memory controllers than their
   bigger siblings. This confused the driver.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEENIoOqscayAmBOQ5Iq6sjH5ffWIEFAmTudg4UHHRvbnkubHVj
 a0BpbnRlbC5jb20ACgkQq6sjH5ffWIH4wA//Z+pbRElvnWyK8rTx6SbWFu82D8a/
 dAXx5V+8I6v64MPb9VZXP6KEiBQgk2jD2AsC0+2QrZL9FUnKwnBSDC3rgVWPTxBo
 dTxu8j1PDTlnffU+wuaB+3cCRikwa1h+Fr/SQaphwTLA3nm13CHj+dUOp3ZUR8fT
 vz+M4t3SRgcU/0W40jcLnn1h5hsTNjQWr//zVVdctGr++sl7xtVh7wxZPakTC9RL
 FBMx3elqdroeQ5ILMxC5e1V02tAZVrXxZbSNpLWhH25MBwe8P7rc+SHYfNaddnpx
 3qrOOzRZl3fGifoM+GU/JsMeIYh6FYUhOfBNTjUFWQZP+6mDvgj9WaLxVgw9V99R
 W384K7KnjLSnE01/REZ0x9R1sehXyQIv2zGosJitRuKyLuw5UODx/khzpCG6a0P3
 RPi4tNemscCIr5djX8VBqmyxS5tqUzlBBDskDnsHHS7NXLuYv1O6SqR/7kvCqhFQ
 7/qGWNFbzZOMJZiLGUmmxEv3Pk+tfTlZdYOipfaHpSlNNr9zO07VXBRNK18aqQVp
 3GCpRp3IhTL3EmOE2RaV2uhyRIcpSnjvqi8shoN6p1wy8jQwNKoe3/nt7QobKhCl
 4kYC9q0jNDWgh/QWxgtoB6UzWHIieeVZQQcW0Da4fvlsIBwbzcpu5+j3qaCxUNBD
 jUt/DwSD+D91yPI=
 =4Uqu
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull intel EDAC fixes from Tony Luck:

 - Old igen6 driver could lose pending events during initialization

 - Sapphire Rapids workstations have fewer memory controllers than their
   bigger siblings. This confused the driver.

* tag 'edac_updates_for_v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/igen6: Fix the issue of no error events
  EDAC/i10nm: Skip the absent memory controllers
2023-08-30 19:23:00 -07:00
Linus Torvalds ef2a0b7cdb Devicetree include cleanups for v6.6:
These are the remaining few clean-ups of DT related includes which
 didn't get applied to subsystem trees.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmTucUoACgkQ+vtdtY28
 YcOYoQ//RwIPeWc74PHQbOb6eQR95eTHDcDE1MR9Fw8amqxFaomGlSMpbyVyP4ag
 8p82c6qfJIZautyEikbKFO+iYjFMua0KuOTMVuDxHErQOl6ym4P4Uk3+1h5stVSj
 IdfK4CACtMKxKBOPAcyxJU6HKoWcUtMKsKV6OLdDh7M2Fy/G4RCjv4w1Xf3VAn59
 VOa0KF7FhHU3dhIB/tGsj0t13+3e3kF5+l4+pdoMoZWhR4gac5FJRxiR5dMZG6jr
 VY8i9FZb7DW2VtY78FVVOaYDDVf4vNrc+0kqnCbWUaKACHPgNXC375LvS7jFGXvc
 HYVN3teqhFxNOyoSehn2bdBVwJxjQFgy2gTt2vRWTa/CaUDES90cue2R9GT2Sz0b
 eBc3DQtNeT5m8mrLkuEfZrJjKjaEy2Pr6FjNDhNcmkJak7dkMMgkG/Y/SpNmpZOe
 2C3T6i4i6FUxni/2/rWHSVLnYBGfhPNdwWAZcQOi8rqtzp3tF46wVa345+Ev3VDG
 ECDndH8Qk3gtOmGyeTIvPc51yDP6Hpuh7+0jydtehkXHB+cUJtR+g0efIGf7BDgo
 sQpa1vRxkOolrCxyzKwcogEY7jjeccv/FM7BwaZQKXEibiKGkxeDuahdwbfvDuVq
 br16Uj9VzG8Jl6KK0gexV7kzZAAdw1y3JqPGUZf7hn4zmk099ow=
 =eLMf
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-header-cleanups-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree include cleanups from Rob Herring:
 "These are the remaining few clean-ups of DT related includes which
  didn't get applied to subsystem trees"

* tag 'devicetree-header-cleanups-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  ipmi: Explicitly include correct DT includes
  tpm: Explicitly include correct DT includes
  lib/genalloc: Explicitly include correct DT includes
  parport: Explicitly include correct DT includes
  sbus: Explicitly include correct DT includes
  mux: Explicitly include correct DT includes
  macintosh: Explicitly include correct DT includes
  hte: Explicitly include correct DT includes
  EDAC: Explicitly include correct DT includes
  clocksource: Explicitly include correct DT includes
  sparc: Explicitly include correct DT includes
  riscv: Explicitly include correct DT includes
2023-08-30 17:04:28 -07:00
Linus Torvalds 1a7c611546 Perf events changes for v6.6:
- AMD IBS improvements
 - Intel PMU driver updates
 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events
 - Micro-optimize software events and the ring-buffer code
 - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtBscRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hHoQ/+IBQ8Xi/rcdd40n8OqEB/VBWVuSjNT3uN
 3pHHcTl2Pio9CxBeat42NekNijlRILCKJrZ3Lt3JWBmWyWv5l3KFabelj+lDF2xa
 TVCjTnQNe1+HvrODYnF4ECIs5vaoMVjcJ9jg8+VDgAcOQr1nZs4m5TVAd6TLqPpV
 urBEQVULkkzk7ZRhfrugKhw+wrpWFefgGCx0RV8ijZB7TLMHc2wE+Q/sTxKdKceL
 wNaJaDgV33pZh0aImwR9pKUE532hF1FiBdLuehkh61PZa1L82jzAX1xjw2s1hSa4
 eIWemPHJIYfivRlENbJsDWc4N8gk6ijVHwrxGcr4Axu+NN+zPtQ3ddhaGMAyKdTo
 qUKXH3MZSMIl++jI5Fkc6xM+XLvY1rML62epSzMwu/cc7Z5MeyWdQcri0N9YFuO7
 wUUNnFpU00lwQBLbyyUQ3Zi8E0QV7NuPW4axTkmntiIjMpLagaEvVSf6nf8qLpbE
 WTT16s707t19hUZNazNZ7ONmhly4ALbHFQEH65J2KoYn99fYqy9z68Hwk+xnmykw
 bc3qvfhpw0MImQQ+DqHiBwb4n4UuvY2WlkkZI3FfNeSG63DaM2mZikfpElpXYjn6
 9iOIXvx21Wiq/n0cbLhidI2q/ZzFCzYLCk6ikZ320wb+rhvd7EoSlZil6QSzn3pH
 Qdk+NEZgWQY=
 =ZT6+
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - AMD IBS improvements

 - Intel PMU driver updates

 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events

 - Micro-optimize software events and the ring-buffer code

 - Misc cleanups & fixes

* tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call
  perf/x86/intel: Add Crestmont PMU
  x86/cpu: Update Hybrids
  x86/cpu: Fix Crestmont uarch
  x86/cpu: Fix Gracemont uarch
  perf: Remove unused extern declaration arch_perf_get_page_size()
  perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability
  perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src
  perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA
  perf/mem: Introduce PERF_MEM_LVLNUM_UNC
  perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin
  locking/arch: Avoid variable shadowing in local_try_cmpxchg()
  perf/core: Use local64_try_cmpxchg in perf_swevent_set_period
  perf/x86: Use local64_try_cmpxchg
  perf/amd: Prevent grouping of IBS events
2023-08-28 16:35:01 -07:00
Rob Herring 408d808893 EDAC: Explicitly include correct DT includes
The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it was merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.

Link: https://lore.kernel.org/r/20230714174434.4054728-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
2023-08-28 13:31:01 -05:00
Avadhut Naik c4d07c3712 EDAC/amd64: Add support for AMD family 1Ah models 00h-1Fh and 40h-4Fh
Add support for family 1Ah-based models 00h-1Fh and 40h-4Fh.

  [ bp: Simplify. ]

Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230809035244.2722455-4-avadhut.naik@amd.com
2023-08-10 14:25:21 +02:00
Peter Zijlstra 0cfd8fbadd x86/cpu: Fix Crestmont uarch
Sierra Forest and Grand Ridge are both E-core only using Crestmont
micro-architecture, They fit the pre-existing naming scheme prefectly
fine, adhere to it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20230807150405.757666627@infradead.org
2023-08-09 21:51:06 +02:00
Qiuxu Zhuo ce53ad81ed EDAC/igen6: Fix the issue of no error events
Current igen6_edac checks for pending errors before the registration
of the error handler. However, there is a possibility that the error
occurs during the registration process, leading to unhandled pending
errors and no future error events. This issue can be reproduced by
repeatedly injecting errors during the loading of the igen6_edac.

Fix this issue by moving the pending error handler after the registration
of the error handler, ensuring that no pending errors are left unhandled.

Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Reported-by: Ee Wey Lim <ee.wey.lim@intel.com>
Tested-by: Ee Wey Lim <ee.wey.lim@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230725080427.23883-1-qiuxu.zhuo@intel.com
2023-08-02 13:09:56 -07:00
Qiuxu Zhuo c545f5e412 EDAC/i10nm: Skip the absent memory controllers
Some Sapphire Rapids workstations' absent memory controllers
still appear as PCIe devices that fool the i10nm_edac driver
and result in "shift exponent -66 is negative" call traces
from skx_get_dimm_info().

Skip the absent memory controllers to avoid the call traces.

Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Closes: https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reported-by: Koba Ko <koba.ko@canonical.com>
Closes: https://lore.kernel.org/linux-edac/SA1PR11MB71305B71CCCC3D9305835202892AA@SA1PR11MB7130.namprd11.prod.outlook.com/T/#t
Tested-by: Koba Ko <koba.ko@canonical.com>
Fixes: d4dc89d069 ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230710013232.59712-1-qiuxu.zhuo@intel.com
2023-07-24 08:57:26 -07:00
Linus Torvalds aa35a4835e - Add initial support for RAS hardware found on AMD server GPUs (MI200).
Those GPUs and CPUs are connected together through the coherent fabric
   and the GPU memory controllers report errors through x86's MCA so EDAC
   needs to support them. The amd64_edac driver supports now HBM (High
   Bandwidth Memory) and thus such heterogeneous memory controller
   systems
 
 - Other small cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZiUwACgkQEsHwGGHe
 VUphSQ/+JLXTAQ06CNos98MR8iCGdThVujhWt1pBIgjhQFJuf4JlEEtKs9htjbud
 9HZvgnGbHahRoO8pMCB0jwtz0ATrPbaOvz4BofVp3SIRiR5jMI0tfmyl8iSrnA3Q
 m5pbMh6uiIAlH8aPqQXret2iwp7JXOjnBWksgbmUWkI7d2qseKu98ikXyC4QoCaD
 AGRJJ6OCA3P85rdT9qabOuXh6yoELOPKw3j243s22sTLiqn+EuoTE+QX5ZjrQ8Ts
 DyXN/pYI/vGVP7sECkWf7PsEf1BkL6m5KeXDB4Ij2YJesQnBlBZQdAcxdGdY8z3M
 f/qpLdrYvpcLHQy42Jm5VnnISOvMvAl8YWqCEyUmBjXcLwSPNIKHN9LQuznhnQHr
 vssRVqQUg1J+/UWAoIzHdrAQ6zvgv1xlX2dG2YOw3t1WMDnMhztW3eoQv04etD3d
 fqQH3MrkGHI4qeq1Mice1Gz+NWQG/PXVhgBzbTBDDCiRJkg1Dhxce1OMRUiM4tUW
 0JABoU+KS0RZAKXAwine6v5duYmwK36Vl1SSCCWjqFMeR7XMwWWHA9d7t8+wdT1l
 KBIEiRTcRnXaZXyLUPSPRbEF5ALS25RgWVPCA3ibuSUnJjGU7Z7/rbwlQryAefVB
 nqjATed0zat4fbL9bvnDuOKQEzkuySvUWpU+Eozxbct6oRu5ms0=
 =Vcif
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - Add initial support for RAS hardware found on AMD server GPUs (MI200).

   Those GPUs and CPUs are connected together through the coherent
   fabric and the GPU memory controllers report errors through x86's MCA
   so EDAC needs to support them. The amd64_edac driver supports now HBM
   (High Bandwidth Memory) and thus such heterogeneous memory controller
   systems

 - Other small cleanups and improvements

* tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  EDAC/amd64: Cache and use GPU node map
  EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh
  EDAC/amd64: Document heterogeneous system enumeration
  x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors
  x86/amd_nb: Re-sort and re-indent PCI defines
  x86/amd_nb: Add MI200 PCI IDs
  ras/debugfs: Fix error checking for debugfs_create_dir()
  x86/MCE: Check a hw error's address to determine proper recovery action
2023-06-26 15:09:18 -07:00
Linus Torvalds e5ce2f196f - amd64_edac: Add support for Zen4 client hardware
- amd64_edac: Remove the version string as it is useless and actively
   confusing when looking at backported versions of the driver
 
 - Add a driver for the Nuvoton NPCM memory controller
 
 - A debugfs error checking cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZSk4ACgkQEsHwGGHe
 VUqPaQ/8Dwc8vS4iibztAFyYisRGrJRfVP6k7Nl2tSgCi+Tg0BWNFTSMyBzoLNuY
 ewGUe1ZKNkKb3Vs8OE3E48vstVd6J/jcoMAxUmtl4uHxzjhVfoIruBD/xK2Q6mO1
 UDfRrfT2LZv/0/Tn7++QP3R3aQLvDqJC6IVAG1Hn4hqHSnhw7CqgCetbBY/M+hQR
 p9Xjtb2Gbm1UwMEK+z9DG9jNZR2vtPRfOeieAcHpOnDwTe2QY1jQGoeeVDfdfJbC
 iU2D87ad1V7o4p+7Eur0wwg8smuWqSVslWId6+qmtL4xePK6JUL9D+3kPEO4AjWV
 iYqDi4EcdXOglYnAEvKhRbN8eCFMaYyoZqpC10DUTccyWv5w/CW2tRc7ZOKDPgyZ
 LVpupz87rKaJ2C6ymQ41vv98hpHEiGSSHserK0aY4K03ecL+pnHp4Qu3ZID8YLCo
 V6P1R7S63YFO1TU0LSWiVBBcmoWg0Zy5MQkKc+2PcWYm6soGDYFoD5lURVoVAiw4
 YZhReq58NQwyZQYhxgpBmdZYaLlrvGiGQZx/dhuR5C2qF3uL3wdi5mYvP/vSmKbG
 vLPMl/DrqGQEHJnCU2U8Xo3kss3mf/Qv7qvusaxkjcub8wvfKRbX7w4QhXSU7+qb
 1sf6LPWBOk+xb2daUM1tzaMUnF3Pr+8gbzlAxlu1SmtG/HiC7JA=
 =e4qx
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - amd64_edac: Add support for Zen4 client hardware

 - amd64_edac: Remove the version string as it is useless and actively
   confusing when looking at backported versions of the driver

 - Add a driver for the Nuvoton NPCM memory controller

 - A debugfs error checking cleanup

* tag 'edac_updates_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/npcm: Add NPCM memory controller driver
  dt-bindings: memory-controllers: nuvoton: Add NPCM memory controller
  EDAC/thunderx: Check debugfs file creation retval properly
  EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh
  EDAC/amd64: Remove module version string
2023-06-26 15:06:42 -07:00
Yazen Ghannam 4251566ebc EDAC/amd64: Cache and use GPU node map
AMD systems have historically provided an "AMD Node ID" that is a unique
identifier for each die in a multi-die package. This was associated with
a unique instance of the AMD Northbridge on a legacy system. And now it
is associated with a unique instance of the AMD Data Fabric on modern
systems. Each instance is referred to as a "Node"; this is an
AMD-specific term not to be confused with NUMA nodes.

The data fabric provides a number of interfaces accessible through a set
of functions in a single PCI device. There is one PCI device per Data
Fabric (AMD Node), and multi-die systems will see multiple such PCI
devices. The AMD Node ID matches a Node's position in the PCI hierarchy.
For example, the Node 0 is accessed using the first PCI device, Node 1
is accessed using the second, and so on. A logical CPU can find its AMD
Node ID using CPUID. Furthermore, the AMD Node ID is used within the
hardware fabric, so it is not purely a logical value.

Heterogeneous AMD systems, with a CPU Data Fabric connected to GPU data
fabrics, follow a similar convention. Each CPU and GPU die has a unique
AMD Node ID value, and each Node ID corresponds to PCI devices in
sequential order.

However, there are two caveats:
1) GPUs are not x86, and they don't have CPUID to read their AMD Node ID
like on CPUs. This means the value is more implicit and based on PCI
enumeration and hardware-specifics.
2) There is a gap in the hardware values for AMD Node IDs. Values 0-7
are for CPUs and values 8-15 are for GPUs.

For example, a system with one CPU die and two GPUs dies will have the
following values:
  CPU0 -> AMD Node 0
  GPU0 -> AMD Node 8
  GPU1 -> AMD Node 9

EDAC is the only subsystem where this has a practical effect. Memory
errors on AMD systems are commonly reported through MCA to a CPU on the
local AMD Node. The error information is passed along to EDAC where the
AMD EDAC modules use the AMD Node ID of reporting logical CPU to access
AMD Node information.

However, memory errors from a GPU die will be reported to the CPU die.
Therefore, the logical CPU's AMD Node ID can't be used since it won't
match the AMD Node ID of the GPU die. The AMD Node ID of the GPU die is
provided as part of the MCA information, and the value will match the
hardware enumeration (e.g. 8-15).

Handle this situation by discovering GPU dies the same way as CPU dies
in the AMD NB code. But do a "node id" fixup in AMD64 EDAC where it's
needed.

The GPU data fabrics provide a register with the base AMD Node ID for
their local "type", i.e. GPU data fabric. This value is the same for all
fabrics of the same type in a system.

Read and cache the base AMD Node ID from one of the GPU devices during
module initialization. Use this to fixup the "node id" when reporting
memory errors at runtime.

  [ bp: Squash a fix making gpu_node_map static as reported by
        Tom Rix <trix@redhat.com>.
    Link: https://lore.kernel.org/r/20230610210930.174074-1-trix@redhat.com ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515113537.1052146-6-muralimk@amd.com
2023-06-19 13:01:44 +02:00
Borislav Petkov (AMD) 852667c317 Merge ras/edac-drivers into for-next
* ras/edac-drivers:
  EDAC/npcm: Add NPCM memory controller driver
  dt-bindings: memory-controllers: nuvoton: Add NPCM memory controller

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-06-12 15:15:36 +02:00
Marvin Lin d244c610f1 EDAC/npcm: Add NPCM memory controller driver
Add driver for memory controller present on Nuvoton NPCM SoCs. The
memory controller supports single bit error correction and double bit
error detection.

Signed-off-by: Marvin Lin <milkfafa@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230111093245.318745-4-milkfafa@gmail.com
2023-06-12 15:14:10 +02:00
Borislav Petkov (AMD) 0a81fa5d74 Merge ras/edac-misc into for-next
* ras/edac-misc:
  EDAC/thunderx: Check debugfs file creation retval properly

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-06-07 10:50:07 +02:00
Yeqi Fu bf5c04ddd3 EDAC/thunderx: Check debugfs file creation retval properly
edac_debugfs_create_file() returns ERR_PTR by way of the respective
debugfs function it calls, if an error occurs.

The appropriate way to verify for errors is to use IS_ERR(). Do so.

  [ bp: Rewrite all text. ]

Signed-off-by: Yeqi Fu <asuk4.q@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230517173111.365787-1-asuk4.q@gmail.com
2023-06-06 23:04:56 +02:00
Muralidhara M K 9c42edd571 EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh
AMD Family 19h Model 30h-3Fh systems can be connected to AMD MI200
accelerator/GPU devices such that the CPU and GPU data fabrics are
connected together. In this configuration, the CPU manages error logging
and reporting for MCA banks located on the GPUs. This includes HBM memory
errors reported from Unified Memory Controllers (UMCs) on the GPUs.
The GPU memory errors are handled like CPU memory errors.

AMD CPU UMC support in EDAC can be re-used for GPU UMC support. However,
keeping them separate means drastic changes in one path (e.g. to support
newer products) should have less impact on the other path.

Also, simplify the "gpu_" helper functions where possible. GPU product
configuration, like memory type and channel count, is fixed compared to
CPU products.

GPU UMCs each have four physical connections (phys) connected to eight
channels. There is a single "chip select". This differs from CPUs where
each UMC has one physical connection connected to one channel, and each
channel has up to four "chip selects".

Enumerate each UMC "phy" as an EDAC CSROW, since there is only a single
chip select for each physical connection. This is similar to how a CPU
UMC "phy" is enumerated as an EDAC CHANNEL, since there is only a single
channel for each physical connection.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515113537.1052146-5-muralimk@amd.com
2023-06-05 12:27:18 +02:00
Yazen Ghannam c35977b00f x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors
The MI200 (Aldebaran) series of devices introduced a new SMCA bank type
for Unified Memory Controllers. The MCE subsystem already has support
for this new type. The MCE decoder module will decode the common MCA
error information for the new bank type, but it will not pass the
information to the AMD64 EDAC module for detailed memory error decoding.

Have the MCE decoder module recognize the new bank type as an SMCA UMC
memory error and pass the MCA information to AMD64 EDAC.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515113537.1052146-3-muralimk@amd.com
2023-06-05 12:27:11 +02:00
Manivannan Sadhasivam cbd77119b6 EDAC/qcom: Get rid of hardcoded register offsets
The LLCC EDAC register offsets varies between each SoC. Hardcoding the
register offsets won't work and will often result in crash due to
accessing the wrong locations.

Hence, get the register offsets from the LLCC driver matching the
individual SoCs.

Cc: <stable@vger.kernel.org> # 6.0: 5365cea199 ("soc: qcom: llcc: Rename reg_offset structs to reflect LLCC version")
Cc: <stable@vger.kernel.org> # 6.0: c13d7d261e ("soc: qcom: llcc: Pass LLCC version based register offsets to EDAC driver")
Cc: <stable@vger.kernel.org> # 6.0
Fixes: a6e9d7ef25 ("soc: qcom: llcc: Add configuration data for SM8450 SoC")
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20230517114635.76358-3-manivannan.sadhasivam@linaro.org
2023-05-26 20:56:55 -07:00
Manivannan Sadhasivam 3d49f7406b EDAC/qcom: Remove superfluous return variable assignment in qcom_llcc_core_setup()
"ret" variable will be assigned on both success and failure cases. So there
is no need to initialize it during start of qcom_llcc_core_setup().

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20230517114635.76358-2-manivannan.sadhasivam@linaro.org
2023-05-26 20:56:54 -07:00
Hristo Venev 6c79e42169 EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh
Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
instead of 12.

With two 32GB dual-rank DIMMs the sizes appear to be reported correctly:

  EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
  EDAC amd64: F19h_M60h detected (node 0).
  EDAC MC: UMC0 chip selects:
  EDAC amd64: MC: 0:     0MB 1:     0MB
  EDAC amd64: MC: 2: 16384MB 3: 16384MB
  EDAC MC: UMC1 chip selects:
  EDAC amd64: MC: 0:     0MB 1:     0MB
  EDAC amd64: MC: 2: 16384MB 3: 16384MB
  AMD64 EDAC driver v3.5.0

ECC errors can also be detected:

  mce: [Hardware Error]: Machine check events logged
  [Hardware Error]: Corrected error, no action required.
  [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b
  [Hardware Error]: Error Addr: 0x00000007ff7e93c0
  [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203
  [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
  EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1)
  [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

According to Mario Limonciello, the same code should also work for
models 70h-7Fh (follow thread in Link).

  [ bp: Massage, the translation logic updates are pending. ]

Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20230425201239.324476-1-hristo@venev.name
Link: https://lore.kernel.org/r/20230511174506.875153-2-hristo@venev.name
2023-05-15 16:32:47 +02:00
Yazen Ghannam b34348a0d7 EDAC/amd64: Remove module version string
The AMD64 EDAC module version information is not exposed through ABI
like MODULE_VERSION(). Instead it is printed during module init.

Version numbers can be confusing in cases where module updates are
partly backported resulting in a difference between upstream and
backported module versions.

Remove the AMD64 EDAC module version information to avoid user
confusion.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230410190959.3367528-1-yazen.ghannam@amd.com
2023-05-10 15:49:52 +02:00
Linus Torvalds 556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Linus Torvalds a907047732 ARM: SoC drivers for v6.4
The most notable updates this time are for Qualcomm Snapdragon platforms.
 The Inline-Crypto-Engine gets a new DT binding and driver. A number of
 drivers now support additional Snapdragon variants, in particular the
 rsc, scm, geni, bwm, glink and socinfo, while the llcc (edac) and rpm
 drivers get notable functionality updates.
 
 Updates on other platforms include:
 
  - Various updates to the Mediatek mutex and mmsys drivers, including
    support for the Helio X10 SoC
 
  - Support for unidirectional mailbox channels in Arm SCMI firmware
 
  - Support for per cpu asynchronous notification in OP-TEE firmware
 
  - Minor updates for memory controller drivers.
 
  - Minor updates for Renesas, TI, Amlogic, Apple, Broadcom, Tegra,
    Allwinner, Versatile Express, Canaan, Microchip, Mediatek and i.MX
    SoC drivers, mainly updating the use of MODULE_LICENSE() macros and
    obsolete DT driver interfaces.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRGmncACgkQYKtH/8kJ
 Uif6ghAAw1TiPTJzJLLCNx+txOVFB62WDglv3T1CufjfcWp0Eh0RJSCcsCOPV+/7
 UHi4+X4nPAcudeOFMFtslCR8ExLRWY4j7t2ZYo/k+VI3jdB8Qkbr6NAQgAuRdLYX
 WZ1cV6o76B3bhO2HqSVNVZ8/3Z7OAYw4j9VDD/4AbW+l3GyentlQTjabpJNREvSS
 5HzT3ZI33o7M8mM4uYmmEXVrg8sCupbRyL9S7jTiFXRLcfqujclhfezJ4UrJJv7b
 wxGf+e2YNMqKH6PiKYufzN1TYI2D0YQeB1m56Y9FsAKxgAyHh2xWpsHeyVnaw0jc
 KaKjRN/H3JDlW/VCMAjQOIShCZdAs02xHnEXxY6pKLMM6i8/FkzzNIxNQwXrx5KH
 zYESXVd6suOI0eCZT8zkKKLHRT5EJRaliUv5Z+Qp2BBe3vJVZD0JqSlZ7lOznplF
 lviwL6ydAMr2cfTgfMxbRiYQVDzncFkfnR3t55SC6rYjGt6QWjeS0dDbGHf4WVC4
 FDbnST4JaBmi+frh55VooX7EpzIv9wa0/taayaChd9qvXnh22uqaqho1sPYKZ6BI
 OXduHQ3qojJhKKKK1VJKzN5Ef3OHLQLNrvcc1DsKILrrES4w4LX1C9dmyh2CLXLo
 q5cX6L1iB1Hx5tujalDYBsHBBmbiT/1tNM2S7pAGigiGy4KEc28=
 =r6jm
 -----END PGP SIGNATURE-----

Merge tag 'soc-drivers-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull ARM SoC driver updates from Arnd Bergmann:
 "The most notable updates this time are for Qualcomm Snapdragon
  platforms. The Inline-Crypto-Engine gets a new DT binding and driver,
  and a number of drivers now support additional Snapdragon variants, in
  particular the rsc, scm, geni, bwm, glink and socinfo, while the llcc
  (edac) and rpm drivers get notable functionality updates.

  Updates on other platforms include:

   - Various updates to the Mediatek mutex and mmsys drivers, including
     support for the Helio X10 SoC

   - Support for unidirectional mailbox channels in Arm SCMI firmware

   - Support for per cpu asynchronous notification in OP-TEE firmware

   - Minor updates for memory controller drivers.

   - Minor updates for Renesas, TI, Amlogic, Apple, Broadcom, Tegra,
     Allwinner, Versatile Express, Canaan, Microchip, Mediatek and i.MX
     SoC drivers, mainly updating the use of MODULE_LICENSE() macros and
     obsolete DT driver interfaces"

* tag 'soc-drivers-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (165 commits)
  soc: ti: smartreflex: Simplify getting the opam_sr pointer
  bus: vexpress-config: Add explicit of_platform.h include
  soc: mediatek: Kconfig: Add MTK_CMDQ dependency to MTK_MMSYS
  memory: mtk-smi: mt8365: Add SMI Support
  dt-bindings: memory-controllers: mediatek,smi-larb: add mt8365
  dt-bindings: memory-controllers: mediatek,smi-common: add mt8365
  memory: tegra: read values from correct device
  dt-bindings: crypto: Add Qualcomm Inline Crypto Engine
  soc: qcom: Make the Qualcomm UFS/SDCC ICE a dedicated driver
  dt-bindings: firmware: document Qualcomm QCM2290 SCM
  soc: qcom: rpmh-rsc: Support RSC v3 minor versions
  soc: qcom: smd-rpm: Use GFP_ATOMIC in write path
  soc/tegra: fuse: Remove nvmem root only access
  soc/tegra: cbb: tegra194: Use of_address_count() helper
  soc/tegra: cbb: Remove MODULE_LICENSE in non-modules
  ARM: tegra: Remove MODULE_LICENSE in non-modules
  soc/tegra: flowctrl: Use devm_platform_get_and_ioremap_resource()
  soc: tegra: cbb: Drop empty platform remove function
  firmware: arm_scmi: Add support for unidirectional mailbox channels
  dt-bindings: firmware: arm,scmi: Support mailboxes unidirectional channels
  ...
2023-04-25 12:02:16 -07:00
Borislav Petkov (AMD) ce8ac91130 Merge branches 'edac-drivers', 'edac-amd64' and 'edac-misc' into edac-updates
Combine all queued EDAC changes for submission into v6.4:

* ras/edac-drivers:
  EDAC/i10nm: Add Intel Sierra Forest server support
  EDAC/skx: Fix overflows on the DRAM row address mapping arrays

* ras/edac-amd64: (27 commits)
  EDAC/amd64: Fix indentation in umc_determine_edac_cap()
  EDAC/amd64: Add get_err_info() to pvt->ops
  EDAC/amd64: Split dump_misc_regs() into dct/umc functions
  EDAC/amd64: Split init_csrows() into dct/umc functions
  EDAC/amd64: Split determine_edac_cap() into dct/umc functions
  EDAC/amd64: Rename f17h_determine_edac_ctl_cap()
  EDAC/amd64: Split setup_mci_misc_attrs() into dct/umc functions
  EDAC/amd64: Split ecc_enabled() into dct/umc functions
  EDAC/amd64: Split read_mc_regs() into dct/umc functions
  EDAC/amd64: Split determine_memory_type() into dct/umc functions
  EDAC/amd64: Split read_base_mask() into dct/umc functions
  EDAC/amd64: Split prep_chip_selects() into dct/umc functions
  EDAC/amd64: Rework hw_info_{get,put}
  EDAC/amd64: Merge struct amd64_family_type into struct amd64_pvt
  EDAC/amd64: Do not discover ECC symbol size for Family 17h and later
  EDAC/amd64: Drop dbam_to_cs() for Family 17h and later
  EDAC/amd64: Split get_csrow_nr_pages() into dct/umc functions
  EDAC/amd64: Rename debug_display_dimm_sizes()

* ras/edac-misc:
  EDAC/altera: Remove MODULE_LICENSE in non-module
  EDAC: Sanitize MODULE_AUTHOR strings
  EDAC/amd81[13]1: Remove trailing newline from MODULE_AUTHOR
  EDAC/i5100: Fix typo in comment
  EDAC/altera: Remove redundant error logging

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-04-24 09:14:30 +02:00
Qiuxu Zhuo 96ae3995c6 EDAC/i10nm: Add Intel Sierra Forest server support
The Sierra Forest CPU model uses similar memory controller registers as
Granite Rapids server. Add Sierra Forest CPU model ID for EDAC support.

Tested-by: Li Zhang <li4.zhang@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230410131531.11914-1-qiuxu.zhuo@intel.com
2023-04-10 09:33:51 -07:00
Yang Li 49aba1c589 EDAC/amd64: Fix indentation in umc_determine_edac_cap()
Use consistent indentation to improve the readability and fix:

  drivers/edac/amd64_edac.c:1279 umc_determine_edac_cap() warn: inconsistent indenting

Fixes: f6a4b4a1aa ("EDAC/amd64: Split determine_edac_cap() into dct/umc functions")
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230404022557.46409-1-yang.lee@linux.alibaba.com
2023-04-04 17:22:55 +02:00
Nick Alcock e088d80e2a EDAC/altera: Remove MODULE_LICENSE in non-module
Since

  8b41fc4454 ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"),

MODULE_LICENSE declarations are used to identify modules. As
a consequence, uses of the macro in non-modules will cause modprobe to
misidentify their containing object file as a module when it is not
(false positives), and modprobe might succeed rather than failing with
a suitable error message.

altera_edac is not a module for a while now, remove the macro call.

Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230217141059.392471-24-nick.alcock@oracle.com
2023-04-01 13:18:50 +02:00
Borislav Petkov (AMD) 371b27f2f3 EDAC: Sanitize MODULE_AUTHOR strings
Fixup the remaining MODULE_AUTHOR strings to not contain newlines.
Shorten and unbreak others.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230328134309.23159-1-bp@alien8.de
2023-03-28 15:43:30 +02:00
Jonathan Neuschäfer 01db1030f1 EDAC/amd81[13]1: Remove trailing newline from MODULE_AUTHOR
MODULE_AUTHOR strings don't usually include a newline character.

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230129165054.1675554-1-j.neuschaefer@gmx.net
2023-03-28 15:26:52 +02:00
Muralidhara M K b3ece3a6a2 EDAC/amd64: Add get_err_info() to pvt->ops
GPU Nodes will use a different method to determine the chip select
and channel of an error. A function pointer should be used rather than
introduce another branching condition.

Prepare for this by adding get_err_info() to pvt->ops. This function is
only called from the modern code path, so a legacy function is not
defined.

Make sure to call this after MCA_STATUS[SyndV] is checked, since the
csrow value is found in MCA_SYND.

  [ Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-23-yazen.ghannam@amd.com
2023-03-24 13:03:21 +01:00
Muralidhara M K f6f36382d6 EDAC/amd64: Split dump_misc_regs() into dct/umc functions
Add a function pointer to pvt->ops.

No functional change is intended.

  [ Yazen: Rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-22-yazen.ghannam@amd.com
2023-03-24 13:03:21 +01:00
Muralidhara M K 6fb8b5fb9e EDAC/amd64: Split init_csrows() into dct/umc functions
Call them from their respective setup_mci_misc_attrs() paths.

Also, drop the check for an "empty" device, i.e. one without memory.
This is redundant and already done in instance_has_memory() earlier in
the init path.

No functional change is intended.

  [ Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-21-yazen.ghannam@amd.com
2023-03-24 13:03:21 +01:00
Muralidhara M K f6a4b4a1aa EDAC/amd64: Split determine_edac_cap() into dct/umc functions
Call them from their respective setup_mci_misc_attrs() paths.

No functional change is intended.

  [ Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-20-yazen.ghannam@amd.com
2023-03-24 13:03:21 +01:00
Yazen Ghannam 9369239e8d EDAC/amd64: Rename f17h_determine_edac_ctl_cap()
...to match the "umc_" prefix convention.

No functional change is intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-19-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K 0a42a37f65 EDAC/amd64: Split setup_mci_misc_attrs() into dct/umc functions
The init_one_instance() path is shared between legacy and modern
systems. So add the new functions to a function pointer in pvt->ops.

No functional change is intended.

  [ Yazen: Rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-18-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K eb2bcdfc37 EDAC/amd64: Split ecc_enabled() into dct/umc functions
Call them using a function pointer in pvt->ops. The "ECC enabled"
check is done outside of the hardware information gathering done in
hw_info_get(). So a high-level function pointer is needed to separate
the legacy and modern paths.

No functional change is intended.

  [Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-17-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K 32ecdf8688 EDAC/amd64: Split read_mc_regs() into dct/umc functions
Call them from their respective hw_info_get() paths.

ECC symbol size is not needed on UMC systems, so determine_ecc_sym_sz()
is left out of the UMC path. Do not save TOP_MEM* values on modern
controllers because they're not needed there (read: they were used only
for debugging, if anything).

  [ Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-16-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K 78ec161a91 EDAC/amd64: Split determine_memory_type() into dct/umc functions
Call them from their respective hw_info_get() paths.

Call them after all other hardware registers have been saved, since the
memory type for a device will be determined based on the saved
information.

  [ Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-15-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K b29dad9bf3 EDAC/amd64: Split read_base_mask() into dct/umc functions
Call them from their respective hw_info_get() paths.

Call the new functions after the setting the chip select base and mask
counts, since those are need to read the correct number of chip select
base and mask registers. And call the new functions before the remaining
set up, because the base and mask register values will be needed later.

  [Yazen: Rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-14-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K 637f60ef2c EDAC/amd64: Split prep_chip_selects() into dct/umc functions
Call them from their respective hw_info_get() function. Avoid the
need for family/model-based function pointers.

Add the calls before reading hardware registers from the memory
controllers, since the number of chip select bases and masks needs to be
known first.

  [ Yazen: rebased/reworked patch and reworded commit message. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-13-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Yazen Ghannam 9a97a7f4d7 EDAC/amd64: Rework hw_info_{get,put}
The bulk of system-specific information is gathered at init time with
hw_info_get(). This function calls a number of helper functions, and
many of these helper functions are split between a modern UMC/DF path
and a legacy DCT path.

Split hw_info_get() into legacy and modern versions. This creates two
separate code paths early on, and legacy and modern helper functions can
be called directly in the appropriate code path.

Also, simplify hw_info_put() and share it between legacy and modern
systems. NULL pointer checks are done in pci_dev_put() and kfree(), so
they can be called unconditionally.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-12-yazen.ghannam@amd.com
2023-03-24 13:03:20 +01:00
Muralidhara M K ed623d55ee EDAC/amd64: Merge struct amd64_family_type into struct amd64_pvt
Future AMD systems will support heterogeneous "AMD Node" types, e.g.
CPU and GPU types. Therefore, a global family type shared across all
AMD nodes is no longer appropriate.

Move struct low_ops routines and members of struct amd64_family_type
to struct amd64_pvt.

Currently, there are many code branches that split between "modern" and
"legacy" systems. Another code branch will be needed in order to cover
GPU cases. However, rather than introduce another branching case in
multiple functions, the current branching code should be switched to a
set of function pointers. This change makes the code more readable and
simplifies adding support for new families/models.

In order to reuse code, define two sets of function pointers. Use one
for modern systems (Family 17h and later). This will not change between
current CPU families. Use another set of function pointers for legacy
systems (before Family 17h). Use the Family 16h versions as default
for the legacy ops since these are the latest, and adjust the function
pointers as needed for older families.

  [ Yazen: rebased/reworked patch and reworded commit message. ]
  [  bp: Fix rev8 or later check. ]

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-11-yazen.ghannam@amd.com
2023-03-24 13:03:19 +01:00
Yazen Ghannam 5a1adb375d EDAC/amd64: Do not discover ECC symbol size for Family 17h and later
The ECC symbol size was needed on legacy system to lookup the ECC syndrome.
This is not needed on modern systems because the ECC syndrome is explicitly
provided in the MCA information.

Remove the ECC symbol size discovery code for modern UMC-based systems.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-10-yazen.ghannam@amd.com
2023-03-24 13:03:19 +01:00
Yazen Ghannam a2e59ab8e9 EDAC/amd64: Drop dbam_to_cs() for Family 17h and later
The same function is used to calculate chip select size for all Zen-based
family/models. Therefore, a family/model function pointer is not necessary.

Drop the dbam_to_cs() function pointer for Family 17h and later systems.
Also, move the Family 17h function to avoid a forward declaration. Rename
it to indicate that the UMC Address Mask is used rather than the legacy
DBAM value.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-9-yazen.ghannam@amd.com
2023-03-24 13:03:19 +01:00
Yazen Ghannam c0984666fd EDAC/amd64: Split get_csrow_nr_pages() into dct/umc functions
Split get_csrow_nr_pages() into a legacy and modern versions in preparation
for further legacy/modern refactoring.

Also, rename f17_get_cs_mode() to match the new convention.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-8-yazen.ghannam@amd.com
2023-03-24 13:03:19 +01:00
Yazen Ghannam 00e4feb8c0 EDAC/amd64: Rename debug_display_dimm_sizes()
Use the "dct" and "umc" prefixes for legacy and modern versions
respectively.

Also, move the "dct" version to avoid a forward declaration, and fixup
some checkpatch warnings in the process.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127170419.1824692-7-yazen.ghannam@amd.com
2023-03-24 12:54:47 +01:00
Jongwoo Han 5b6cb45072 EDAC/i5100: Fix typo in comment
Correct typo from 'preform' to 'perform' in comment.

Signed-off-by: Jongwoo Han <jongwooo.han@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230302021120.56794-1-jongwooo.han@gmail.com
2023-03-23 12:04:04 +01:00
Greg Kroah-Hartman cb4a0bec0b EDAC/sysfs: move to use bus_get_dev_root()
Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rric@kernel.org>
Cc: linux-edac@vger.kernel.org
Link: https://lore.kernel.org/r/20230313182918.1312597-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22 09:25:49 +01:00
Deepak R Varma 4e89780a4c EDAC/altera: Remove redundant error logging
A call to platform_get_irq() already prints an error on failure within
its own implementation. So printing another error based on its return
value in the caller is redundant and should be removed. The clean up
also makes if condition block braces unnecessary. Remove that as well.

Issue identified using platform_get_irq.cocci coccinelle semantic patch.

Signed-off-by: Deepak R Varma <drv@mailo.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y/+j27kqdhflPtaj@ubun2204.myguest.virtualbox.org
2023-03-21 22:50:46 +01:00
Manivannan Sadhasivam 721d3e91bf qcom: llcc/edac: Support polling mode for ECC handling
Not all Qcom platforms support IRQ mode for ECC handling. For those
platforms, the current EDAC driver will not be probed due to missing ECC
IRQ in devicetree.

So add support for polling mode so that the EDAC driver can be used on all
Qcom platforms supporting LLCC.

The polling delay of 5000ms is chosen based on Qcom downstream/vendor
driver.

Reported-by: Luca Weiss <luca.weiss@fairphone.com>
Tested-by: Luca Weiss <luca.weiss@fairphone.com>
Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s
Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20230314080443.64635-14-manivannan.sadhasivam@linaro.org
2023-03-15 15:17:08 -07:00