Bank 4 is reserved on family 0x17 and shouldn't generate any MCE
records. However, broken hardware and software is not something unheard
of so warn about bank 4 errors. They shouldn't be coming from bank 4
naturally but users can still use mce_amd_inj to simulate errors from it
for testing purposed.
Also, avoid special handling in the injector mce_amd_inj like it is
being done on the older families.
[ bp: Rewrite commit message and merge into one patch. Use boot_cpu_data. ]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Link: http://lkml.kernel.org/r/1473384591-5323-1-git-send-email-Yazen.Ghannam@amd.com
Link: http://lkml.kernel.org/r/1473384591-5323-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The MCA_SYND and MCA_IPID registers contain valuable information and
should be included in MCE output. The MCA_SYND register contains
syndrome and other error information, and the MCA_IPID register will
uniquely identify the MCA bank's type without having to rely on system
software.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Scalable MCA defines a number of IP types. An MCA bank on an SMCA
system is defined as one of these IP types. A bank's type is uniquely
identified by the combination of the HWID and MCATYPE values read from
its MCA_IPID register.
Add the required tables in order to be able to lookup error descriptions
based on a bank's type and the error's extended error code.
[ bp: Align comments, simplify a bit. ]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The error descriptions defined for Fam17h can be reused for other SMCA
systems, so their names should reflect this.
Change f17h prefix to smca for error descriptions.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472673994-12235-4-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Print SyndV bit status and print the raw value of the MCA_SYND register.
Further decoding of the syndrome from struct mce.synd can be done in
other places where appropriate, e.g. DRAM ECC.
Boris: make the error stanza more compact by putting the error address
and syndrome on the same line:
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-|CE|-|PCC|AddrV|-|-|SyndV|CECC]: 0x96204100001e0117
[Hardware Error]: Error Addr: 0x000000007f4c52e3, Syndrome: 0x0000000000000000
[Hardware Error]: Invalid IP block specified.
[Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RD
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use X86_FEATURE_SMCA when detecting if SMCA is available instead of
directly using CPUID 0x80000007_EBX.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1462971509-3856-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For Scalable MCA enabled processors, errors are listed per IP block. And
since it is not required for an IP to map to a particular bank, we need
to use HWID and McaType values from the MCx_IPID register to figure out
which IP a given bank represents.
We also have a new bit (TCC) in the MCx_STATUS register to indicate Task
context is corrupt.
Add logic here to decode errors from all known IP blocks for Fam17h
Model 00-0fh and to print TCC errors.
[ Minor fixups. ]
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1457021458-2522-3-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, when decoding an MCE, we display 'CE' for a Deferred error, like
this:
[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|Deferred|-|UECC]: 0xdc04b00095080813
When the 'UC' bit in the MCx_STATUS register is clear, the error status
is either a Corrected error or Deferred error as determined by the
'Deferred' bit. So do not print 'CE' on a deferred error.
Refer to AMD Error Scope Hierarchy table in a newer BKDG (example:
49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features").
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1436788382-6463-1-git-send-email-aravind.gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Write out MCx_ADDR into the more humanly readable "MCx Error Address"
and remove double colon in the output.
Cc: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
295d8cda26 ("EDAC, MCE, AMD: Drop local coreid reporting") removed the
code snippet which used that mask but forgot to drop the mask itself. Do
that now.
Signed-off-by: Borislav Petkov <bp@suse.de>
We want to still be able to issue some error information on systems for
which there is no decoding support (think older distro kernels here,
for example). Therefore, we allow module registration but skip the
per-family bank-specific decoders and issue the general information
only, i.e.:
[ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
[ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)
with the hope that it still contains helpful useful bits.
Suggested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Link: http://lkml.kernel.org/r/1392659391-2411-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Add a new error signature for Family 15h, models 30h-3fh. Patch has been
tested on Fam15h using mce_amd_inj facility and has been verified to
work correctly.
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
[ cleanup commit message and error string ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Initially, those strings describing different parts of an MCE message
were shared with amd64_edac and were therefore exported to modules.
However, all except pp_msgs are used only in one place right now so hide
them and make them static.
No functionality change.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@alien8.de>
Add MCE decoding logic for AMD Family 16h processors.
Boris:
- drop unneeded uu_msgs export
- exit early in cat_mc1_mce and save us an indentation level
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Borislav Petkov <bp@alien8.de>
Currently only AMD Family 15h processors have special handling for MC2
errors. Since upcoming Family 16h will also need unique handling, let's
make MC2 handling part of amd_decoder_ops.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Borislav Petkov <bp@alien8.de>
It is very useful to have the family/model/stepping with the reported
error so dump it. This saves us asking the bug reporter about it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Having the functional unit names in each bank decode is only misleading
as this code supports multiple families and there's no guarantee the
mapping between FUs and MCE banks will stay the same.
And also, knowing the functional unit name doesn't help much since you
end up looking at the respective BKDG anyway.
So drop all FU references and use the MC bank numbers instead.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
MCA details seldom change inbetween the models of a family so don't
be too conservative and enable decoding on everything starting from
K8 onwards. Minor adjustments can come in later but most importantly,
we have some decoding infrastructure in place for upcoming models by
default.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
... so that checkpatch can chill out.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Correct their formulation, replace per-family functions with a single,
unified lookup table.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
This MC1 error signature is called differently now, fix it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Use "System Read Data Error" as a more general name for MC0 bus errors
on F15h and update some error definitions.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Drop third nbcfg argument which is old remains and not required anymore.
No functionality change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
MCE decoding code is reporting the core which encountered the error
unconditionally now so drop this piece. Besides, it reported the
coreid in the local processor package which is not that valuable as a
datapoint.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The MCi_STATUS bank has a AddrV bit which, when set, denotes that the
corresponding MCi_ADDR MSR contains a valid address belonging to the
MCE currently being reported. Dump it since it is definitely relevant
information.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Currently, correctable ECCs go through mcelog and do not print the scary
MCE banner. In that case, however, reporting the core where the CECC
happened is important information so dump it along with the decoded
string albeit at risk of having a minor redundancy.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add the PCI device ids required for driver registration. Remove
pvt->ctl_name and use the family descriptor directly, instead. Then,
bump driver version and fixup its format. Finally, enable DRAM ECC
decoding on F15h.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Remove reporting of errors with UC bit set - this is done by the MCE
decoding code anyway and this driver deals with DRAM ECC errors only. UC
(NB uncorrectable error) doesn't necessarily mean it is a DRAM error.
Remove unused macros while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Minor formatting fixup since the information which core was associated
with the MCE is not always valid.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Building for X86_32 produces shift count warnings, so use BIT_64() to
eliminate the warnings.
drivers/edac/mce_amd.c:778: warning: left shift count >= width of type
drivers/edac/mce_amd.c:778: warning: left shift count >= width of type
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: bluesmoke-devel@lists.sourceforge.net
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Now that everything is inplace, enable MCE decoding on F15h. Make
initcall routine a bit more readable.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Shorten up MCi_STATUS flags and add BD's new deferred and poison types.
Also, simplify formatting.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
MCE bank 2 is redefined from a BU to a CU (Combined Unit) bank on F15h.
Add a decoder function for CU MCEs.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a decoder for F15h DC MCEs to support the new types of DC MCEs
introduced by the BD microarchitecture.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
F15h enlarges the extended error code of an MCE to a 5-bit field
(MCi_STATUS[20:16]). Add a mask variable which default 0xf is overridden
on F15h.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Fix
drivers/edac/mce_amd.c:262: warning: left shift count >= width of type
on 32-bit builds.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>