OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Mukul Joshi	d4300362a6	drm/amdkfd: Update interrupt handling for GFX 9.4.3 For GFX 9.4.3, interrupt handling needs to be updated for: - Interrupt cookie will have a NodeId field. Each KFD node needs to check the NodeId before processing the interrupt. - For CPX mode, there are additional checks of client ID needed to process the interrupt. - Add NodeId to the process drain interrupt. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-06-30 13:11:35 -04:00
Jonathan Kim	12fb1ad70d	drm/amdkfd: update process interrupt handling for debug events The debugger must be notified by any debugger subscribed exception that comes from hardware interrupts. If a debugger session exits, any exceptions it subscribed to may still have interrupts in the interrupt ring buffer or KGD/KFD pipeline. To prevent a new session from inheriting stale interrupts, when a new queue is created, open an interrupt drain and allow the IH ring to drain from a timestamped checkpoint. Then inject a custom IV so that once the custom IV is picked up by the KFD, it's safe to close the drain and proceed with queue creation. The drain must also be on debug disable as SW interrupts may still be processed. Drain at this time and clear all the exception status. The debugger may also not be attached nor subscibed to certain exceptions so forward them directly to the runtime. GFX10 also requires its own IV processing, hence the creation of kfd_int_process_v10.c. This is because the IV from SQ interrupts are packed into a new continguous format unlike GFX9. To make this clear, a separate interrupting handling code file was created. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-06-09 12:36:17 -04:00
Jonathan Kim	c2d2588c70	drm/amdkfd: add send exception operation Add a debug operation that allows the debugger to send an exception directly to runtime through a payload address. For memory violations, normal vmfault signals will be applied to notify runtime instead after passing in the saved exception data when a memory violation was raised to the debugger. For runtime exceptions, this will unblock the runtime enable function which will be explained and implemented in a follow up patch. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-06-09 12:36:01 -04:00
Mukul Joshi	8dc1db3172	drm/amdkfd: Introduce kfd_node struct (v5) Introduce a new structure, kfd_node, which will now represent a compute node. kfd_node is carved out of kfd_dev structure. kfd_dev struct now will become the parent of kfd_node, and will store common resources such as doorbells, GTT sub-alloctor etc. kfd_node struct will store all resources specific to a compute node, such as device queue manager, interrupt handling etc. This is the first step in adding compute partition support in KFD. v2: introduce kfd_node struct to gc v11 (Hawking) v3: make reference to kfd_dev struct through kfd_node (Morris) v4: use kfd_node instead for kfd isr/mqd functions (Morris) v5: rebase (Alex) Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-06-09 09:42:27 -04:00
Mukul Joshi	cc009e613d	drm/amdkfd: Add KFD support for soc21 v3 Add initial support for soc21 in KFD compute driver (Mukul) - Add new definition for soc21 device. - Add new file for amdgpu-kfd interface for GFX11 family. - Add new file for queue management, interrupt handling, mqd management for GFX11 family in KFD driver. - Related changes/updates for soc21 device in KFD driver. - Repurpose last 2 entries of SDMA MQD for driver use. v2: Add an optional argument into update queue operation (Mukul) v3: Switch to ip version check, replace kgd_dev with amdgpu_device (Hawking) Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-05-04 10:43:54 -04:00
Felix Kuehling	c3eb12dff0	drm/amdkfd: Ignore bogus signals from MEC efficiently MEC firmware sometimes sends signal interrupts without a valid context ID on end of pipe events that don't intend to signal any HSA signals. This triggers the slow path in kfd_signal_event_interrupt that scans the entire event page for signaled events. Detect these signals in the top half interrupt handler to stop processing them as early as possible. Because we now always treat event ID 0 as invalid, reserve that ID during process initialization. v2: Update firmware version checks to support more GPUs Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-04-25 17:05:48 -04:00
Tao Zhou	ed94aca6db	drm/amdkfd: print unmap queue status for RAS poison consumption (v3) Print the status out when it passes, and also tell user gpu reset is triggered when we fall back to legacy way. v2: make the message more explicit. v3: change succeeds to succeeded. replace pr_warn with dev_warn. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-25 12:40:26 -04:00
Tao Zhou	1990e29b19	drm/amdkfd: add RAS poison consumption handling for UTCL2 (v2) Do RAS page retirement and use gpu reset as fallback in UTCL2 fault handler. v2: replace vm fault event with posion consumed event in UTCL2 poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-25 12:40:26 -04:00
Tao Zhou	9d8a8d78d9	drm/amdkfd: replace source_id with client_id for RAS poison consumption Client ID is more accruate here and we can deal with more different cases with client ID. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-25 12:40:25 -04:00
Tao Zhou	eed4197530	drm/amdkfd: refine event_interrupt_poison_consumption Combine reading and setting poison flag as one atomic operation and add print message for the function. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-03-25 12:40:25 -04:00
Tao Zhou	29b440d204	drm/amdkfd: add return value check for queue eviction Otherwise gpu reset will be triggered unconditionally in poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-16 17:30:02 -05:00
Rajneesh Bhardwaj	2243f4937a	drm/amdkfd: Fix leftover errors and warnings A bunch of errors and warnings are leftover KFD over the years, attempt to fix the errors and most warnings reported by checkpatch tool. Still a few warnings remain which may be false positives so ignore them for now. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj	d87f36a063	drm/amdkfd: update SPDX license header Update the SPDX License header for all the KFD files. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Tao Zhou	b1c87b0874	drm/amdkfd: use unmap all queues for poison consumption Replace reset queue for specific PASID with unmap all queues, reset queue could break CP scheduler. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 14:15:07 -05:00
Tao Zhou	03e5b167bd	drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid As the function is used in more different cases, use a more general name. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 14:14:53 -05:00
yipechai	5b0ce2d41b	drm/amdkfd: enable sdma ecc interrupt event can be handled by event_interrupt_wq_v9 Enable sdma ecc interrupt event can be handled by event_interrupt_wq_v9. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-01-07 17:19:14 -05:00
Tao Zhou	b6485bed40	drm/amdkfd: reset queue which consumes RAS poison (v2) CP supports unmap queue with reset mode which only destroys specific queue without affecting others. Replacing whole gpu reset with reset queue mode for RAS poison consumption saves much time, and we can also fallback to gpu reset solution if reset queue fails. v2: Return directly if process is NULL; Reset queue solution is not applicable to SDMA, fallback to legacy way; Call kfd_unref_process after lookup process. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-12-28 16:02:59 -05:00
Graham Sider	f0dc99a6f7	drm/amdkfd: add kfd_device_info_init function Initializes kfd->device_info given either asic_type (enum) if GFX version is less than GFX9, or GC IP version if greater. Also takes in vf and the target compiler gfx version. Uses SDMA version to determine num_sdma_queues_per_engine. Convert device_info to a non-pointer member of kfd, change references accordingly. Change unsupported asic condition to only probe f2g, move device_info initialization post-switch. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-12-01 16:15:37 -05:00
Graham Sider	6bfc7c7e17	drm/amdkfd: replace kgd_dev in various amgpu_amdkfd funcs Modified definitions: - amdgpu_amdkfd_submit_ib - amdgpu_amdkfd_set_compute_idle - amdgpu_amdkfd_have_atomics_support - amdgpu_amdkfd_flush_gpu_tlb_pasid - amdgpu_amdkfd_flush_gpu_tlb_pasid - amdgpu_amdkfd_gpu_reset - amdgpu_amdkfd_alloc_gtt_mem - amdgpu_amdkfd_free_gtt_mem - amdgpu_amdkfd_alloc_gws - amdgpu_amdkfd_free_gws - amdgpu_amdkfd_ras_poison_consumption_handler Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:01 -05:00
Tao Zhou	c749094923	amd/amdkfd: add ras page retirement handling for sq/sdma (v3) In ras poison mode, page retirement will be handled by the irq handler of the module which consumes corrupted data. v2: rename ras_process_cb to ras_poison_consumption_handler. move the handler's implementation from ASIC specific file to common file. v3: call gpu reset for xGMI connected mode. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-04 15:22:57 -04:00
Hawking Zhang	4a1d4b6d38	drm/amdkfd: add sdma poison consumption handling Follow the same apporach as GFX to handle SDMA poison consumption. Send SIGBUS to application when receives SDMA_ECC interrupt and issue gpu reset either mode 2 or mode 1 to get the engine back Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li<dennis.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-06-07 14:57:24 -04:00
Dennis Li	e2b1f9f52b	drm/amdkfd: refine the poison data consumption handling The user applications maybe register the KFD_EVENT_TYPE_HW_EXCEPTION and KFD_EVENT_TYPE_MEMORY events, driver could notify them when poison data consumed. Beside that, some applications maybe register SIGBUS signal hander. These applications will handle poison data by themselves, exit or re-create context to re-dispatch works. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-05-19 22:29:44 -04:00
Hawking Zhang	be9064b7bc	drm/amdgpu: remove unnecessary header include amdgpu.h is included in kfd_priv.h Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-04-28 23:35:50 -04:00
Dennis Li	20161e51dc	drm/amdkfd: add edc error interrupt handle for poison propogate mode In poison progogate mode, when driver receive the edc error interrupt from SQ, driver should kill the process by pasid which is using the poison data, and then trigger GPU reset. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-04-20 21:35:13 -04:00
Oak Zeng	6d909c5da0	drm/amdkfd: Add kernel parameter to stop queue eviction on vm fault This is to keep wavefront context for debug purpose Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-03-23 22:59:22 -04:00
Tao Zhou	7af103ea87	drm/amdkfd: check more client ids in interrupt handler Add check for SExSH clients in kfd interrupt handler. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-01-08 15:18:25 -05:00
Alex Deucher	ae279f693c	drm/amdkfd: check both client id and src id in interrupt handlers We can have the same src ids for different client ids so make sure to check both the client id and the source id when handling interrupts. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-12-23 15:07:40 -05:00
Amber Lin	938a0650aa	drm/amdkfd: Provide SMI events watch When the compute is malfunctioning or performance drops, the system admin will use SMI (System Management Interface) tool to monitor/diagnostic what went wrong. This patch provides an event watch interface for the user space to register devices and subscribe events they are interested. After registered, the user can use annoymous file descriptor's poll function with wait-time specified and wait for events to happen. Once an event happens, the user can use read() to retrieve information related to the event. VM fault event is done in this patch. v2: - remove UNREGISTER and add event ENABLE/DISABLE - correct kfifo usage - move event message API to kfd_ioctl.h v3: send the event msg in text than in binary v4: support multiple clients v5: move events enablement from ioctl to fd write v6: sparse fix Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-07-15 13:27:34 -04:00
Aishwarya Ramakrishnan	8c8e1f6984	drm/amdkfd: Fix boolreturn.cocci warnings Return statements in functions returning bool should use true/false instead of 1/0. drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c:40:9-10: WARNING: return of 0/1 in function 'event_interrupt_isr_v9' with return type bool Generated by: scripts/coccinelle/misc/boolreturn.cocci Signed-off-by: Aishwarya Ramakrishnan <aishwaryarj100@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2020-05-21 12:37:19 -04:00
Yong Zhao	3fe023d42e	drm/amdkfd: Query vmid pasid mapping through stored info for non HWS Because we record the mapping under non HWS mode in the software, we can query pasid through vmid using the stored mapping instead of reading from ATC registers. This also prepares for the defeatured ATC block in future ASICs. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-10-03 09:11:03 -05:00
Yong Zhao	0ad8c5e296	drm/amdkfd: Support MMHUB1 in kfd interrupt path Handle interrupts for second mmhub. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-07-18 14:18:04 -05:00
Yong Zhao	a53a11a835	drm/amdkfd: Workaround PASID missing in gfx9 interrupt payload under non HWS This is a known gfx9 HW issue, and this change can perfectly workaround the issue. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2018-11-19 16:38:14 -05:00
Yong Zhao	00557f4131	drm/amdkfd: Adjust the debug message in KFD ISR This makes debug message get printed even when there is early return. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2018-11-19 16:38:14 -05:00
Lan Xiao	58e6988612	drm/amdkfd: fix zero reading of VMID and PASID for Hawaii Upon VM Fault, the VMID and PASID written by HW are zeros in Hawaii. Instead of reading from ih_ring_entry, read directly from the registers. This workaround fix the soft hang issues caused by mishandled VM Fault in Hawaii. Signed-off-by: Lan Xiao <Lan.Xiao@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>	2018-07-11 22:32:51 -04:00
shaoyunl	2640c3facb	drm/amdkfd: Handle VM faults in KFD 1. Pre-GFX9 the amdgpu ISR saves the vm-fault status and address per per-vmid. amdkfd needs to get the information from amdgpu through the new get_vm_fault_info interface. On GFX9 and later, all the required information is in the IH ring 2. amdkfd unmaps all queues from the faulting process and create new run-list without the guilty process 3. amdkfd notifies the runtime of the vm fault trap via EVENT_TYPE_MEMORY Signed-off-by: shaoyun liu <shaoyun.liu@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>	2018-07-11 22:32:50 -04:00
Felix Kuehling	c129db1206	drm/amdkfd: Add sanity checks in IRQ handlers Only accept interrupts from KFD VMIDs. Just checking for a PASID may not be enough because amdgpu started using PASIDs to map VM faults to processes. Warn if an IRQ doesn't have a valid PASID (indicating a firmware bug). Suggested-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Suggested-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>	2018-05-01 17:56:12 -04:00
Felix Kuehling	ca750681bc	drm/amdkfd: Add SOC15 interrupt processing support Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>	2018-04-10 17:33:10 -04:00

37 Commits