powerpc/eeh: Set channel state after notifying the drivers
When a PCI error is encountered 6th time in an hour we set the channel state to perm_failure and notify the driver about the permanent failure. However, after upstream commit38ddc01147
("powerpc/eeh: Make permanently failed devices non-actionable"), EEH handler stops calling any routine once the device is marked as permanent failure. This issue can lead to fatal consequences like kernel hang with certain PCI devices. Following log is observed with lpfc driver, with and without this change, Without this change kernel hangs, If PCI error is encountered 6 times for a device in an hour. Without the change EEH: Beginning: 'error_detected(permanent failure)' PCI 0132:60:00.0#600000: EEH: not actionable (1,1,1) PCI 0132:60:00.1#600000: EEH: not actionable (1,1,1) EEH: Finished:'error_detected(permanent failure)' With the change EEH: Beginning: 'error_detected(permanent failure)' EEH: Invoking lpfc->error_detected(permanent failure) EEH: lpfc driver reports: 'disconnect' EEH: Invoking lpfc->error_detected(permanent failure) EEH: lpfc driver reports: 'disconnect' EEH: Finished:'error_detected(permanent failure)' To fix the issue, set channel state to permanent failure after notifying the drivers. Fixes:38ddc01147
("powerpc/eeh: Make permanently failed devices non-actionable") Suggested-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230209105649.127707-1-ganeshgr@linux.ibm.com
This commit is contained in:
parent
4f11410bf6
commit
9efcdaac36
|
@ -1065,10 +1065,10 @@ recover_failed:
|
|||
eeh_slot_error_detail(pe, EEH_LOG_PERM);
|
||||
|
||||
/* Notify all devices that they're about to go down. */
|
||||
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
|
||||
eeh_set_irq_state(pe, false);
|
||||
eeh_pe_report("error_detected(permanent failure)", pe,
|
||||
eeh_report_failure, NULL);
|
||||
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
|
||||
|
||||
/* Mark the PE to be removed permanently */
|
||||
eeh_pe_state_mark(pe, EEH_PE_REMOVED);
|
||||
|
@ -1185,10 +1185,10 @@ void eeh_handle_special_event(void)
|
|||
|
||||
/* Notify all devices to be down */
|
||||
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
|
||||
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
|
||||
eeh_pe_report(
|
||||
"error_detected(permanent failure)", pe,
|
||||
eeh_report_failure, NULL);
|
||||
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
|
||||
|
||||
pci_lock_rescan_remove();
|
||||
list_for_each_entry(hose, &hose_list, list_node) {
|
||||
|
|
Loading…
Reference in New Issue