OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jiapeng Chong	18b6731538	habanalabs: Fix kernel-doc Fix the following W=1 kernel warnings: drivers/misc/habanalabs/common/mmu/mmu_v1.c:425: warning: expecting prototype for hl_mmu_fini(). Prototype was for hl_mmu_v1_fini() instead. drivers/misc/habanalabs/common/mmu/mmu_v1.c:449: warning: expecting prototype for hl_mmu_ctx_init(). Prototype was for hl_mmu_v1_ctx_init() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:21 +03:00
Jiapeng Chong	858e6d4085	habanalabs: Fix kernel-doc Fix the following W=1 kernel warnings: drivers/misc/habanalabs/common/pci/pci.c:454: warning: expecting prototype for hl_fw_fini(). Prototype was for hl_pci_fini() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:21 +03:00
Dan Carpenter	a43a9f6777	habanalabs: fix double unlock on error in map_device_va() If hl_mmu_prefetch_cache_range() fails then this code calls mutex_unlock(&ctx->mmu_lock) when it's no longer holding the mutex. Fixes: `9e495e2400` ("habanalabs: do MMU prefetch as deferred work") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:21 +03:00
Tal Cohen	90de680526	habanalabs: use separate structure info for each error collect data Create separate info structure for each error type. The structures shall be used inside the large structure that contains the last session error. This is more scalable for adding more errors in the future. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:21 +02:00
Yuri Nudelman	f873a27fd5	habanalabs: fix missing handle shift during mmap During mmap operation on the unified memory manager buffer, the vma page offset is shifted to extract the handle value. Due to a typo, it was not shifted back at the end. That could cause the offset to be modified after mmap operation, that may affect subsequent operations. In addition, in allocation flow, in case of out of memory error, idr would not be correctly destroyed, again because of a missing shift. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:21 +02:00
Ohad Sharabi	e31dd9362f	habanalabs: remove hdev from hl_ctx_get args This argument is unused by the function. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:21 +02:00
Ohad Sharabi	9e495e2400	habanalabs: do MMU prefetch as deferred work When user requests to prefetch the MMU translations, the driver will not block the user until prefetch is done. Instead, the prefetch work will be delegated to a WQ which will do it in the background. This way, the prefetch may progress without blocking the user at all. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:21 +02:00
Yuri Nudelman	83617f5a87	habanalabs: order memory manager messages Changing format of memory manager messages to make it more readable. In addition, reducing the priority of a warning on missing handle during put. This scenario is not an indication of a problem and may happen in a legal flow, when handle is put from multiple flows. For example, in timeout and completion. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:21 +02:00
Oded Gabbay	804d514d79	habanalabs: return -EFAULT on copy_to_user error If copy_to_user failed in info ioctl, we always return -EFAULT so the user will know there was an error. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Oded Gabbay	49d2a8af97	habanalabs: use NULL for eventfd eventfd is pointer. As such, it should be initialized to NULL, not to 0. In addition, no need to initialize it after creation because the entire structure is zeroed-out. Also, no need to initialize it before release because the entire structure is freed. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Oded Gabbay	368b0b4fd6	habanalabs: update firmware header Update cpucp_if.h to latest version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Tal Cohen	422ef17103	habanalabs: add support for notification via eventfd The driver will be able to send notification events towards a user process, using user's registered event file descriptor. The driver uses the notification mechanism to inform the user about an occurred event. A user thread can wait until a notification is received from the driver. The driver stores the occurred event until the user reads it, using HL_INFO_GET_EVENTS - new ioctl opcode in the INFO ioctl. Gaudi specific implementation includes sending a notification on a TPC assertion event that is received from f/w. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Yuri Nudelman	f2daa2d97e	habanalabs: add topic to memory manager buffer Currently, buffers from multiple flows pass through the same infra. This way, in logs, we are unable to distinguish between buffers that came from separate flows. To address this problem, add a "topic" to buffer behavior descriptor - a string identifier that will be used to identify in logs the flow this buffer relates to. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Dani Liberman	c37803388c	habanalabs: handle race in driver fini Scenario: 1. During hard reset, driver executes device_kill_open_processes. 2. Drivers file descriptor is not closed yet (user process is alive), hence we are starting loop on all open file descriptors. 3. Just before getting task struct of user process, according to pid, SIGKILL is sent to the user process, hence get_pid_task fails, driver prints a warning and device_kill_open_processes returns an error. 4. Returned error causing driver fini do disable the device object of the process which causes a kernel crash. The fix is to handle this case not as an error and continue fini flow as normal, since the killed process (by the SIGKILL) will release its resources just like it will do when the driver sends him the sigkill. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Dafna Hirschfeld	0688474eda	habanalabs: add device memory scrub ability through debugfs Add the ability to scrub the device memory with a given value. Add file 'dram_mem_scrub_val' to set the value and a file 'dram_mem_scrub' to scrub the dram. This is very important to help during automated tests, when you want the CI system to randomize the memory before training certain DL topologies. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:20 +02:00
Yuri Nudelman	829ec038c9	habanalabs: use unified memory manager for CB flow With the new code required for the flow added, we can now switch to using the new memory manager infrastructure, removing the old code. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:19 +02:00
Yuri Nudelman	dc653c36c9	habanalabs: unified memory manager new code for CB flow This commit adds the new code needed for command buffer flow using the new unified memory manager, without changing the actual functionality. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:19 +02:00
Ofir Bitton	2db04a6826	habanalabs/gaudi: set arbitration timeout to a high value In certain workloads, arbitration timeout might expire although no actual issue present. Hence, we set timeout to a very high value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:19 +02:00
Yuri Nudelman	ff086c186b	habanalabs: add put by handle method to memory manager Putting object by its handle and not by object pointer is useful in some finalization flows that do not have object pointer available. It eliminates the need to first get the object and then perform put twice. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:19 +02:00
Yuri Nudelman	4e63ce6af6	habanalabs: hide memory manager page shift The new unified memory manager uses page offset to pass buffer handle during the mmap operation. One problem with this approach is that it requires the handle to always be divisible by the page size, else, the user would not be able to pass it correctly as an argument to the mmap system call. Previously, this was achieved by shifting the handle left after alloc operation, and shifting it right before get operation. This was done in the user code. This creates code duplication, and, what's worse, requires some knowledge from the user regarding the handle internal structure, hurting the encapsulation. This patch encloses all the page shifts inside memory manager functions. This way, the user can take the handle as a black box, and simply use it, without any concert about how it actually works. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:19 +02:00
farah kassabri	de3484dfaa	habanalabs: Add separate poll interval value for protocol Currently we're using the same poll interval value for both COMMs protocol(for sending a command and waits for an ACK) and the device CPU boot phases status waits. On COMMs protocol this interval should be much lower than the device CPU boot which may take long time to change status. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:19 +02:00
Dani Liberman	b0b09b7a8b	habanalabs: use get_task_pid() to take PID find_get_pid() isn't good in case the user process was run inside docker. As a result, we didn't had the PID and we couldn't kill the user process in case the device got stuck and we needed to reset the device. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:18 +02:00
Ohad Sharabi	5d1a0de2c7	habanalabs: add prefetch flag to the MAP operation This patch let the user decide whether the translations done in the page tables will be fetched directly to the STLB right after the map. We want to let the user control whether to perform prefetch upon map operation. To do so a memory flag was added, to be used in the MAP ioctl, called HL_MEM_PREFETCH and if set- the mappings will be fetched directly to the STLB after map operation. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:18 +02:00
Robin Murphy	77c97a7ea8	habanalabs: Stop using iommu_present() Even if an IOMMU might be present for some PCI segment in the system, that doesn't necessarily mean it provides translation for the device we care about. Replace iommu_present() with a more appropriate check. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:18 +02:00
Moti Haimovski	0ff1d6f8f5	habanalabs: support debugfs Byte access to device DRAM The habanalabs HW requires memory resources to be used by its internal hardware structures. These structures are allocated and initialized by the driver. We would like to use the device HBM for that purpose. This memory is io-remapped and accessed using the writel()/writeb()/writew() commands. Since some of the HW structures are one byte in size we need to add support for the writeb() and readb() functions in the driver. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:18 +02:00
Ohad Sharabi	ab4ea58728	habanalabs: use for_each_sgtable_dma_sg for dma sgt Instead of using for_each_sg when iterating sgt that contains dma entries, use the more proper for_each_sgtable_dma_sg macro. In addition, both Goya and Gaudi have the exact same implementation of the asic function that encapsulate the usage of this macro, so it is better to move that implementation to the common code. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:18 +02:00
Rajaravi Krishna Katta	b8d852add6	habanalabs/gaudi: use lower_32_bits() for casting Use standard kernel macro to take lower 32 bits of 64-bits variable. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:18 +02:00
Ohad Sharabi	2ba75d3119	habanalabs: refactor HOP functions in MMU V1 Take advantage of the HOPs shift/masks now defined as arrays. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:17 +02:00
Rajaravi Krishna Katta	b31848430f	habanalabs: fix comments according to kernel-doc Incorrect/Missing doxygen tag Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:17 +02:00
Oded Gabbay	658591ec32	habanalabs: remove user interrupt debug print As user interrupts are a common use case, this dump pollutes the dmesg log, hence removing it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:17 +02:00
Oded Gabbay	c82b025f2b	habanalabs: don't print normal reset operations Only a hard-reset is an unexpected event which should be notify in the kernel log. Other resets are normal operations and therefore we should not pollute the log with them. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:17 +02:00
Oded Gabbay	738607f005	habanalabs: change a reset print to debug level Currently we have two reset prints per reset. One is in the common code and one in each asic-specific file. We can change the asic-specific message to be debug only as we can know the type of reset being done according to the print in the common code, which is also easier to maintain. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:17 +02:00
Oded Gabbay	fcadbf5688	habanalabs: remove redundant info print Halting compute engines is a print that doesn't add us any information because it is always done in the reset process and not used elsewhere. Even if it was, we don't use prints to mark functions we passed through. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:17 +02:00
Yuri Nudelman	cd92c3678a	habanalabs: wrong handle removal in memory manager During the unified memory manager release, a wrong id was used to remove an entry from the idr. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:16 +02:00
Dafna Hirschfeld	799b9eb01a	habanalabs: remove debugfs read/write callbacks The debugfs memory access now uses the callback 'access_dev_mem' so there is no use of the callbacks 'debugfs_{read32,read64,write32,write6}'. Remove them. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:16 +02:00
Dafna Hirschfeld	9248aa90d2	habanalabs: enforce alignment upon registers access through debugfs When accessing the configuration registers through debugfs, it is only allowed to access aligned address. Fail if address is not aligned. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:16 +02:00
Dafna Hirschfeld	ee8a10c833	habanalabs: unify code for memory access from debugfs Currently each asic version implements 4 callbacks: 'debugfs_{read32/write32/read64/write64}' There is a lot of code duplication among the different callbacks of all asic versions. This patch unify the code in order to avoid the code duplication by iterating the pci_mem_region array in hl_device and use its fields instead of macros. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:16 +02:00
Dafna Hirschfeld	234366d3b6	habanalabs: add callback and field to be used for debugfs refactor This is a preparation for unifying the code of accessing device memory through debugfs. Add struct fields and callbacks that will later be used in debugfs code and will reduce code duplication among the different read{32,64}/write{32,64} callbacks of every asic. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:01:16 +02:00
kernel test robot	116a28ac1f	habanalabs: hl_ts_behavior can be static drivers/misc/habanalabs/common/memory.c:2137:28: warning: symbol 'hl_ts_behavior' was not declared. Should it be static? Fixes: `4d530e7d12` ("habanalabs: convert ts to use unified memory manager") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 21:00:35 +02:00
Ohad Sharabi	d0b59cf68c	habanalabs/gaudi: add debugfs to fetch internal sync status When Gaudi device is secured the monitors data in the configuration space is blocked from PCI access. As we need to enable user to get sync-manager monitors registers when debugging, this patch adds a debugfs that dumps the information to a binary file (blob). When a root user will trigger the dump, the driver will send request to the f/w to fill a data structure containing dump of all monitors registers. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:37 +02:00
Ohad Sharabi	f5d85fe05a	habanalabs: rephrase device out-of-memory message The out of memory message is rephrased to more subtle expression as out of memory may be caused by the user in case of, for example, greedy allocation. In addition the user is also being notified by an error code. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:36 +02:00
Dafna Hirschfeld	c3712c1d7d	habanalabs/gaudi: Use correct sram size macro for debugfs We currently allow accessing the whole SRAM bar size with the macro SRAM_BAR_SIZE, but the actual size of the sram region is the macro SRAM_SIZE which is only a portion of the whole bar size. So when accessing the sram through debugfs, use the macro SRAM_SIZE for the sram size which is the correct macro. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:36 +02:00
Ohad Sharabi	acbabe63ef	habanalabs: add MMU prefetch to ASIC-specific code This is necessary pre-requisite for future ASIC support, where MMU TLB prefetch is supported. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:36 +02:00
Yuri Nudelman	4d530e7d12	habanalabs: convert ts to use unified memory manager With the introduction of the unified memory manager infrastructure, the timestamp buffers can be converted to use it. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:36 +02:00
Yuri Nudelman	babe8e7c04	habanalabs: unified memory manager infrastructure This is a part of overall refactoring attempt to separate nic and the core drivers. Currently, there are 4 different flows, that contain very similar code. These are the ts, nic, hwblocks and cb alloc/map flows. The similar aspect of all these flows is that they all contain a central store, with memory buffers inside, supporting the following set of operations: - Allocate buffer and return handle - Get buffer from the store with handle - Put the buffer (last put releases the buffer) - Map the buffer to the user This patch contains a generic data structure used to implement the above memory buffer store interface. Conversion of the existing code to use the new data structure will follow. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:36 +02:00
Ofir Bitton	b75cce27d0	habanalabs: save f/w preboot major version We need this property for doing backward compatibility hacks against the f/w. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:35 +02:00
Jakob Koschel	9138c24244	habanalabs: replace usage of found with dedicated list iterator variable To move the list iterator variable into the list_for_each_entry_() macro in the future it should be avoided to use the list iterator variable after the loop body. To never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:35 +02:00
Tomer Tayar	687c6b535e	habanalabs: modify dma_mask to be ASIC specific property The required DMA mask is no longer based on input from the F/W, but it is fixed per ASIC according to its address space. As such, the per-ASIC function to get this value can be replaced with a property variable. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:35 +02:00
Ofir Bitton	c41cb902b2	habanalabs: parse full firmware versions When parsing firmware versions strings, driver should not assume a specific length and parse up to the maximum supported version length. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:35 +02:00
Tomer Tayar	9d92689ca2	habanalabs/gaudi: avoid resetting max power in hard reset The default max power is deduced from the card type value in the CPU-CP info. This value is then set in the max power variable of the device structure. Getting the CPU-CP info is done as part of the late init phase which is called also during reset. This means that a max power value which is modified via sysfs will be reset during hard reset back to the default value. As the max power is updated in any case during device init in hl_sysfs_init(), this setting in late init can be removed, and the overriding during reset is thus avoided. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:35 +02:00
Ofir Bitton	b19768d81a	habanalabs/gaudi: increase submission resources In order to allow user to have larger amount of submissions, we increase the DMA and NIC queue depth to 4K. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:34 +02:00
Ofir Bitton	fdec56c1a4	habanalabs: expose compute ctx status through info ioctl In order for the user to know if he can try and open device, we expose the compute ctx state. The user can now know if the context is used by another process or whether the device is still ongoing through cleanup or reset and will be available soon. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:34 +02:00
Ofir Bitton	4c3b9f6e3b	habanalabs: add new return code to device fd open In order to be more informative during device open, we are adding a new return code -EAGAIN that indicates device is still going through resource reclaiming and hence it cannot be used yet. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:34 +02:00
Ohad Sharabi	050a6f349a	habanalabs: add user API to get valid DRAM page sizes Future devices will support multiple device memory page sizes. In addition, an API for the user was added for it to be able to control the device memory allocation page size. This patch is a complementary patch to inform the user of the available page size supported by the device. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:34 +02:00
Ohad Sharabi	06926dbed2	habanalabs: convert all MMU masks/shifts to arrays There is no need to hold each MMU mask/shift as a denoted structure member (e.g. hop0_mask). Instead converting it to array will result in smaller and more readable code. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:34 +02:00
Ohad Sharabi	2f8f0de878	habanalabs: change mmu_get_real_page_size to be ASIC-specific This patch breaks the cumbersome implementation of "get real page size" along with it's multiple inner conditions and implement each case (according to the real complexity) inside an ASIC function. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:33 +02:00
Ohad Sharabi	1359fcbe0f	habanalabs: add DRAM default page size to HW info When using the device memory allocation API the user ought to know what is the default allocation page size. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:33 +02:00
Ohad Sharabi	378b02dc01	habanalabs: set non-0 value in dram default page size Looking forward we will need to report to the user what is the default page size used. This will be done more conveniently by explicitly updating the property rather than to rely on a "0 meaning default" value. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-22 20:57:33 +02:00
Guenter Roeck	94865e2dcb	habanalabs: Fix test build failures allmodconfig builds on 32-bit architectures fail with the following error. drivers/misc/habanalabs/common/memory.c: In function 'alloc_device_memory': drivers/misc/habanalabs/common/memory.c:153:49: error: cast from pointer to integer of different size Fix the typecast. While at it, drop other unnecessary typecasts associated with the same commit. Fixes: `e8458e20e0` ("habanalabs: make sure device mem alloc is page aligned") Cc: Ohad Sharabi <osharabi@habana.ai> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220404134859.3278599-1-linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-04-04 17:03:04 +02:00
Linus Torvalds	02e2af20f4	Char/Misc and other driver updates for 5.18-rc1 Here is the big set of char/misc and other small driver subsystem updates for 5.18-rc1. Included in here are merges from driver subsystems which contain: - iio driver updates and new drivers - fsi driver updates - fpga driver updates - habanalabs driver updates and support for new hardware - soundwire driver updates and new drivers - phy driver updates and new drivers - coresight driver updates - icc driver updates Individual changes include: - mei driver updates - interconnect driver updates - new PECI driver subsystem added - vmci driver updates - lots of tiny misc/char driver updates There will be two merge conflicts with your tree, one in MAINTAINERS which is obvious to fix up, and one in drivers/phy/freescale/Kconfig which also should be easy to resolve. All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG3fQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykNEgCfaRG8CRxewDXOO4+GSeA3NGK+AIoAnR89donC R4bgCjfg8BWIBcVVXg3/ =WWXC -----END PGP SIGNATURE----- Merge tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc and other driver updates from Greg KH: "Here is the big set of char/misc and other small driver subsystem updates for 5.18-rc1. Included in here are merges from driver subsystems which contain: - iio driver updates and new drivers - fsi driver updates - fpga driver updates - habanalabs driver updates and support for new hardware - soundwire driver updates and new drivers - phy driver updates and new drivers - coresight driver updates - icc driver updates Individual changes include: - mei driver updates - interconnect driver updates - new PECI driver subsystem added - vmci driver updates - lots of tiny misc/char driver updates All of these have been in linux-next for a while with no reported problems" * tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (556 commits) firmware: google: Properly state IOMEM dependency kgdbts: fix return value of __setup handler firmware: sysfb: fix platform-device leak in error path firmware: stratix10-svc: add missing callback parameter on RSU arm64: dts: qcom: add non-secure domain property to fastrpc nodes misc: fastrpc: Add dma handle implementation misc: fastrpc: Add fdlist implementation misc: fastrpc: Add helper function to get list and page misc: fastrpc: Add support to secure memory map dt-bindings: misc: add fastrpc domain vmid property misc: fastrpc: check before loading process to the DSP misc: fastrpc: add secure domain support dt-bindings: misc: add property to support non-secure DSP misc: fastrpc: Add support to get DSP capabilities misc: fastrpc: add support for FASTRPC_IOCTL_MEM_MAP/UNMAP misc: fastrpc: separate fastrpc device from channel context dt-bindings: nvmem: brcm,nvram: add basic NVMEM cells dt-bindings: nvmem: make "reg" property optional nvmem: brcm_nvram: parse NVRAM content into NVMEM cells nvmem: dt-bindings: Fix the error of dt-bindings check ...	2022-03-28 12:27:35 -07:00
Ofir Bitton	655221c567	habanalabs: remove deprecated firmware states During driver and F/W handshake, driver waits for F/W to reach certain states in order to progress with the boot flow. Some of the states were deprecated a long time ago and were never present on official firmwares. Therefore, let's remove them from the handshake process. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:34:50 +02:00
Tomer Tayar	b0106bc6fe	habanalabs: add an option to delay a device reset Several H/W events can be sent adjacently, even due to a single error. If a hard-reset is triggered as part of handling one of these events, the following events won't be handled. The debug info from these missed events is important, sometimes even more important than the one that was handled. To allow handling these close events, add an option to delay a device reset and use it when resetting due to H/W events. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:06 +02:00
Jiasheng Jiang	9c27896ac1	habanalabs: Add check for pci_enable_device As the potential failure of the pci_enable_device(), it should be better to check the return value and return error if fails. Fixes: `70b2f993ea` ("habanalabs: create common folder") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:06 +02:00
farah kassabri	a78b07dcae	habanalabs: Fix reset upon device release bug In case user application was interrupted while some cs still in-flight or in the middle of completion handling in driver, the last refcount of the kernel private data for the user process will not be put in the fd close flow, but in the cs completion workqueue context. This means that the device reset-upon-device-release will be called from that context. During the reset flow, the driver flushes all the cs workqueue to ensure that any scheduled work has run to completion, and since we are running from the completion context we will have deadlock. Therefore, we need to skip flushing the workqueue in those cases. It is safe to do it because the user won't be able to release the device unless the workqueues are already empty. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:06 +02:00
Ohad Sharabi	e8458e20e0	habanalabs: make sure device mem alloc is page aligned Working with MMU that supports multiple page sizes requires that mapping of a page of a certain size will be aligned to the same size (e.g. the physical address of 32MB page shall be aligned to 32MB). To achieve this the gen_poll allocation is now using the "align" variant to comply with the alignment requirements. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Oded Gabbay	100fcf1e11	habanalabs/gaudi: add missing handling of NIC related events There are a few events that can arrive from the f/w and without proper handling can cause errors to appear in the kernel log without reason. Add the relevant handling that was missing. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Oded Gabbay	26ef1c000b	habanalabs/gaudi: handle axi errors from NIC engines Various AXI errors can occur in the NIC engines and are reported to the driver by the f/w. Add code to print the errors and ack them to the f/w. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Ohad Sharabi	f23f280277	habanalabs: allow user to set allocation page size In future ASICs the MMU will be able to work with multiple page sizes, thus a new flag is added to allow the user to set the requested page size. This flag is added since the whole DRAM is allocated for the user and the user also should be familiar with the memory usage use case. As such, the user may choose to "over allocate" memory in favor of performance (for instance- large page allocations covers more memory in less TLB entries). For example: say available page sizes are of 1MB and 32MB. If user wants to allocate 40MB the user can either set page size to 1MB and allocate the exact amount of memory (but will result in 40 TLB entries) or the user can use 32MB pages, "waste" 8MB of physical memory but occupy only 2 TLB entries. Note that this feature will be available only to ASIC that supports multiple DRAM page sizes. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Tomer Tayar	59456f4c22	habanalabs: avoid using an uninitialized variable Fix the following compilation warning in hl_cb_ioctl() @ command_buffer.c: warning: ‘device_va’ may be used uninitialized in this function Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Tomer Tayar	2908826d09	habanalabs: set max power on device init per ASIC For current devices there is a need to send the max power value to F/W during device init, for example because there might be several card types. In future devices, this info will be programmed in the device's EEPROM and will be read by F/W, and hence the driver should not send it. Modify the sending of the relevant message to be done only for ASIC types that need it. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Tomer Tayar	35629bc171	habanalabs: use proper max_power variable for device utilization The max_power variable which is used for calculating the device utilization is the ASIC specific property which is set during init. However, the max value can be modified via sysfs, and thus the updated value in the device structure should be used instead. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Tomer Tayar	d01e6cc97b	habanalabs: enable stop-on-error debugfs setting per ASIC On Goya and Gaudi, the stop-on-error configuration can be set via debugfs. However, in future devices, this configuration will always be enabled. Modify the debugfs node to be allowed only for ASICs that support this dynamic configuration. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:05 +02:00
Oded Gabbay	4a0b01fa63	habanalabs: change function to static handle_registration_node() is called directly from the irq handler in irq.c, so it can be static. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Oded Gabbay	9e70ac1aa7	habanalabs: add missing include of vmalloc.h Use of vfree(), vmalloc_user(), vmalloc() and remap_vmalloc_range() requires this include in some architectures. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Oded Gabbay	57b6f02fff	habanalabs: fix use-after-free bug When the code iterates over the free list of physical pages nodes, it deletes the physical page node which is used as the iterator. Therefore, we need to use the safe version of the iteration to prevent use-after-free. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Oded Gabbay	2a835946ee	habanalabs: rephrase error messages in PCI initialization The iATU is an internal h/w machine inside Habana's PCI controller. Mentioning it by name doesn't say anything to the user. It is better to say the PCI controller initialization was not done successfully. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Oded Gabbay	960be39db6	habanalabs: fix spelling mistake The name of the property is hints_range_reservation Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
farah kassabri	9158bf69e7	habanalabs: Timestamps buffers registration Timestamp registration API allows the user to register a timestamp record event which will make the driver set timestamp when CQ counter reaches the target value and write it to a specific location specified by the user. This is a non blocking API, unlike the wait_for_interrupt which is a blocking one. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Dani Liberman	b32cd10480	habanalabs: fix race when waiting on encaps signal Scenario: 1. CS which is part of encaps signal has been completed and now executing kref_put to its encaps signal handle. The refcount of the handle decremented to 0, and called the encaps signal handle release function - hl_encaps_handle_do_release. 2. At this point the user starts waiting on the signal, and finds the encaps signal handle in the handlers list and increment the habdle refcount to 1. 3. Immediately after, hl_encaps_handle_do_release removed the handle from the list and free its memory. 4. Wait function using the handle although it has been freed. This scenario caused the slab area which was previously allocated for the handle to be poison overwritten which triggered kernel bug the next time the OS needed to allocate this slab. Fixed by getting the refcount of the handle only in case it is not zero. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Dan Carpenter	a8076c47f6	habanalabs: silence an uninitialized variable warning Smatch warns that: drivers/misc/habanalabs/common/command_buffer.c:471 hl_cb_ioctl() error: uninitialized symbol 'device_va'. Which is true, but harmless. Anyway, it's easy to silence this by adding a error check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Oded Gabbay	d2cfd6897c	habanalabs: remove duplicate print We print detailed messages inside the internal ioctl functions. No need to print a generic message at the end, it doesn't add any information. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:04 +02:00
Tomer Tayar	930feb41ef	habanalabs: prevent false heartbeat failure during soft-reset The heartbeat thread is active during soft-reset, and it tries to send messages to CPU-CP core. Within the soft-reset, in the time window in which the device is marked as disabled, any CPU-CP command is "silently" skipped and a success value it returned. However, in addition to the return value, the heartbeat function also checks the F/W result, but because no command is sent in this time window, the result variable won't hold the expected value and we will have a false heartbeat failure. To avoid it, modify the "silent" skip to be done only in hard-reset. The CPU-CP should be able to handle messages during soft-reset. In addition to the heartbeat problem, this should also solve other issues in other flows that send messages during soft-reset and use the F/W result as it w/o being aware to the reset. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Oded Gabbay	7a78d4d481	habanalabs: fix race between wait and irq There is a race in the user interrupts code, where between checking the target value and adding the new pend to the list, there is a chance the interrupt happened. In that case, no one will complete the node, and we will get a timeout on it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Oded Gabbay	54faa5607b	habanalabs: fix user interrupt wait when timeout is 0 When timeout is 0, we need to return the busy status in case the target value wasn't reached upon entry to the ioctl. Also return the correct timestamp. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Oded Gabbay	9a79e3e4a3	habanalabs: reject host map with mmu disabled This is not something we can do a workaround. It is clearly an error and we should notify the user that it is an error. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Oded Gabbay	aa3766def7	habanalabs: expose number of user interrupts Currently we only expose to the user the ID of the first available user interrupt. To make user interrupts allocation truly dynamic, we need to also expose the number of user interrupts. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Oded Gabbay	008255ec3d	habanalabs: update to latest f/w specs Copy the latest versions of the f/w specs files. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Tomer Tayar	4ae9548de7	habanalabs: add missing error check in sysfs max_power_show Add a missing error check in the sysfs show function for max_power. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Dani Liberman	15f8eb1905	habanalabs: fix soft reset flow in case of failure In case of soft reset failure, hard reset should be initiated, but reset flags were not set to enable it, which caused another soft reset followed by another failure. Updated reset flags to enable hard reset flow in case of soft reset failure. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Tomer Tayar	aa3e1f12a2	habanalabs: add missing error check in sysfs clk_freq_mhz_show Add a missing error check in the sysfs show functions for clk_max_freq_mhz and clk_cur_freq_mhz_show. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:03 +02:00
Tomer Tayar	ca4c8e4e7b	habanalabs: avoid copying pll data if pll_info_get fails If reading PLL info from F/W fails, the PLL info is not set in the "result" variable, and hence shouldn't be copied to the caller's array. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	7169f0dfec	habanalabs: don't free phys_pg_pack inside lock Freeing phys_pg_pack includes calling to scrubbing functions of the device's memory, taking locks and possibly even calling reset. This is not something that should be done while holding a device-wide spinlock. Therefore, save the relevant objects on a local linked-list and after releasing the spinlock, traverse that list and free the phys_pg_pack objects. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Ohad Sharabi	1dc6cc4b38	habanalabs: duplicate HOP table props to MMU props In order to support several device MMU blocks with different architectures (e.g. different HOP table size) we need to move to per-MMU properties rather than keeping those properties as ASIC properties. Refactoring the code to use "per-MMU proprties" is a major effort. To start making the transition towards this goal but still support taking the properties from ASIC properties (for code that currently uses them) this patch copies some of the properties to the "per-MMU" properties and later, when implementing the per-MMU properties, we would be able to delete the MMU props from the ASIC props. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	e24a62cb68	habanalabs: there is no kernel TDR in future ASICs In future ASICs, there is no kernel TDR for new workloads that are submitted directly from user-space to the device. Therefore, the driver can NEVER know that a workload has timed-out. So, when the user asks us to wait for interrupt on the workload's completion, and the wait has timed-out, it doesn't mean the workload has timed-out. It only means the wait has timed-out, which is NOT an error from driver's perspective. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Rajaravi Krishna Katta	4c01e524b2	habanalabs: sysfs support for fw os version Adds new sysfs entry to display firmware os version /sys/class/habanalabs/hl<n>/fw_os_ver Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	6ba2c0ce26	habanalabs: use common wrapper for MMU cache invalidation We have a common function that wraps the call to the MMU cache invalidation function, which is ASIC-specific. The wrapper checks the return value and prints error if necessary. For consistency, try to use the wrapper when possible. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	2491533808	habanalabs: remove power9 workaround for dma support We don't need this workaround anymore. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	b62ff1a412	habanalabs: add vrm version to sysfs infineon version is only applicable to GOYA and GAUDI. For later ASICs, we display the Voltage Regulator Monitor f/w version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	be028a3648	habanalabs: rename dev_attr_grp to dev_clk_attr_grp In this attribute group we are only adding clocks. This is in preparation for adding a device specific attribute group which is not related to clocks. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	7ae439a061	habanalabs: remove asic callback set_pll_profile() Setting PLL profile is the same for all ASICs, except for GOYA. However, because this function is never called from common code, there is no need to have an asic-specific callback function. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:02 +02:00
Oded Gabbay	092a31c526	habanalabs: move more f/w functions to firmware_if.c For better maintainability, try to concentrate all the common functions that communicate with the f/w in firmware_if.c Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Oded Gabbay	8d96430784	habanalabs: remove hwmgr.c The two remaining functions in this file belong to firmware_if.c, as they communicate with the firmware. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Oded Gabbay	9e2884ce98	habanalabs: get clk is common function Retrieving the clock from the f/w is done exactly the same in ALL our ASICs. Therefore, no real justification for doing it as an ASIC-specific function. The only thing is we need to check if we are running on simulator, which doesn't require ASIC-specific callback. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Oded Gabbay	bfbe9cbedd	habanalabs: sysfs functions should be in sysfs.c Move common sysfs store/show functions to sysfs.c file for consistency. This is part of a patch-set to remove hwmgr.c Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Ohad Sharabi	2bf338f2ac	habanalabs: make some MMU functions common Some MMU functions can be used by different versions of our MMUs, so move them to be common. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Oded Gabbay	d280d5954e	habanalabs: remove ASIC functions of clock gating Now that clock gating is permanently disabled in GAUDI, no need for the ASIC functions of setting and disabling clock gating, as this was a unique scenario in GAUDI. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Oded Gabbay	4edb4ffe39	habanalabs/gaudi: disable CGM permanently Due to the need of SynapseAI to configure all TPC engines from a single QMAN, the driver must disable CGM and never allow the user to enable it. Otherwise, the configuration of the TPC engines will fail. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Ohad Sharabi	eb85eec858	habanalabs: fix possible memory leak in MMU DR fini This patch fixes what seems to be copy paste error. We will have a memory leak if the host-resident shadow is NULL (which will likely happen as the DR and HR are not dependent). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Tomer Tayar	aff5d9d378	habanalabs: check the return value of hl_cs_poll_fences() As part of handling of the multi-CS wait ioctl, hl_cs_poll_fences() is called in a "while (true)" loop. This function can fail, but the checking of its return value was missed. Add this check and exit the loop in case of a failure. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-02-28 14:22:01 +02:00
Gustavo A. R. Silva	5224f79096	treewide: Replace zero-length arrays with flexible-array members There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. This code was transformed with the help of Coccinelle: (next-20220214$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch) @@ identifier S, member, array; type T1, T2; @@ struct S { ... T1 member; T2 array[ - 0 ]; }; UAPI and wireless changes were intentionally excluded from this patch and will be sent out separately. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2022-02-17 07:00:39 -06:00
Ofir Bitton	ce80098db2	habanalabs: support hard-reset scheduling during soft-reset As hard-reset can be requested during soft-reset, driver must allow it or else critical events received during soft-reset will be ignored. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 14:42:31 +02:00
Ofir Bitton	42eb2872e0	habanalabs: add a lock to protect multiple reset variables Atomic operations during reset are replaced by a spinlock in order to have the ability to protect more than a single variable. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 14:42:11 +02:00
Ofir Bitton	eb13529191	habanalabs: refactor reset information variables Unify variables related to device reset, which will help us to add some new reset functionality in future patches. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 14:41:28 +02:00
Ohad Sharabi	60bf3bfb5a	habanalabs: handle skip multi-CS if handling not done This patch fixes issue in which we have timeout for multi-CS although the CS in the list actually completed. Example scenario (the two threads marked as WAIT for the thread that handles the wait_for_multi_cs and CMPL as the thread that signal completion for both CS and multi-CS): 1. Submit CS with sequence X 2. [WAIT]: call wait_for_multi_cs with single CS X 3. [CMPL]: CS X do invoke complete_all for both CS and multi-CS (multi_cs_completion_done still false) 4. [WAIT]: enter poll_fences, reinit the completion and find the CS as completed when asking on the fence but multi_cs_done is still false it returns that no CS actually completed 5. [CMPL]: set multi_cs_handling_done as true 6. [WAIT]: wait for completion but no CS to awake the wait context and hence wait till timeout Solution: if CS detected as completed in poll_fences but multi_cs_done is still false invoke complete_all to the multi-CS completion and so it will not go to sleep in wait_for_completion but rather will have a "second chance" to wait for multi_cs_completion_done. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 14:40:25 +02:00
Tomer Tayar	f297a0e9fe	habanalabs: add CPU-CP packet for engine core ASID cfg In some cases the driver cannot configure ASID of some engines due to the security level of the relevant registers. For this a new CPU-CP packet is introduced, which will allow the driver to ask the F/W to do this configuration instead. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 14:39:53 +02:00
Oded Gabbay	519f4ed0a0	habanalabs: replace some -ENOTTY with -EINVAL -ENOTTY is returned in case of error in the ioctl arguments themselves, such as function that doesn't exists. In all other cases, where the error is in the arguments of the custom data structures that we define that are passed in the various ioctls, we need to return -EINVAL. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:10 +02:00
Ofir Bitton	0a63ac769b	habanalabs: fix comments according to kernel-doc Fix missing fields, descriptions not according to kernel-doc style. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:10 +02:00
Ofir Bitton	a7224c2116	habanalabs: fix endianness when reading cpld version Current sysfs implementation does not take endianness into consideration when dumping the cpld version. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
farah kassabri	b9d31cada7	habanalabs: change wait_for_interrupt implementation Currently the cq counters are allocated in userspace memory, and mapped by the driver to the device address space. A new requirement that is part of new future API related to this one, requires that cq counters will be allocated in kernel memory. We leverage the existing cb_create API with KERNEL_MAPPED flag set to allocate this memory. That way we gain two things: 1. The memory cannot be freed while in use since it's protected by refcount in driver. 2. No need to wake up the user thread upon each interrupt from CQ, because the kernel has direct access to the counter. Therefore, it can make comparison with the target value in the interrupt handler and wake up the user thread only if the counter reaches the target value. This is instead of waking the thread up to copy counter value from user then go sleep again if target value wasn't reached. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Ohad Sharabi	e2558f0f84	habanalabs: prevent wait if CS in multi-CS list completed By the original design we assumed that if we "miss" multi CS completion it is of no severe consequence as we'll just call wait_for_multi_cs again. Sequence of events for such scenario: 1. user submit CS with sequence N 2. user calls wait for multi-CS with only CS #N in the list 3. the multi CS call starts with poll of the CSs but find that none completed (while CS #N did not completed yet) 4. now, multi CS #N complete but multi CS CTX was not yet created for the above multi-CS. so, attempt to complete multi-CS fails (as no multi CS CTX exist) 5. wait_for_multi_cs call now does init_wait_multi_cs_completion (and for this create the multi-CS CTX) 6. wait_for_multi_cs wits on completion but will not get one as CS #N already completed To fix the issue we initialize the multi-CS CTX prior polling the fences. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Ofir Bitton	86c00b2c36	habanalabs: modify cpu boot status error print As BTL can be replaced by ROM we should modify relevant error print. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Ohad Sharabi	d636a932b3	habanalabs: clean MMU headers definitions During the MMU development the MMU header files were left with unclean definitions: - MMU "version specific" definitions that were left in the mmu_general file - unused definitions This patch attempts, where possible, to keep definitions that can serve multiple MMU versions (but that are not tightly bound with specific MMU arch) in the mmu_general header file (e.g. different definitions for number of HOPs). Otherwise, move MMU version specific definitions (e.g. HOPs masks and shifts) to the specific MMU version file. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Ofir Bitton	9993f27de1	habanalabs: expose soft reset sysfs nodes for inference ASIC As we allow soft-reset to be performed only on inference devices, having the sysfs nodes may cause a confusion. Hence, we remove those nodes on training ASICs. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Ofir Bitton	b5c92b8882	habanalabs: sysfs support for two infineon versions Currently sysfs support dumping a single infineon version, in future asics we will have two infineon versions. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Dani Liberman	707c125286	habanalabs: keep control device alive during hard reset Need to allow user retrieve data during reset and afterwards without the need to reopen the device. Did it by seperating the user peocesses list into two lists: 1. fpriv_list which contains list of user processes that opened the device (currently only one). 2. fpriv_ctrl_list which contains list of user processes that opened the control device. This processes in this list shall not be killed during reset, only when the device is suddenly removed from PCI chain. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Oded Gabbay	bb099a8051	habanalabs: fix hwmon handling for legacy f/w In legacy f/w that use old hwmon.h file, the values of the hwmon enums are different than the values that are in newer kernels (5.6 and above). Therefore, to support working with those f/w, we need to do some fixup before registering with the hwmon subsystem and also when calling the functions that communicate with the f/w to retrieve sensors information. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Ofir Bitton	9acdc21b0b	habanalabs: add current PI value to cpu packets In order to increase cpucp messaging reliability we will add the current PI value to the descriptor sent to F/W. F/W will wait for the PI value as an indication of a valid packet. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:09 +02:00
Oded Gabbay	7363805b8a	habanalabs: remove in_debug check in device open The driver supports only a single user anyway, so there is no point in checking whether we are in_debug state when a user tries to open the device, because if we are in_debug, it means a user is already using the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Ofir Bitton	7c623ef732	habanalabs: return correct clock throttling period Current clock throttling period returned from driver was wrong due to wrong time comparison. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Ohad Sharabi	b02220536c	habanalabs: wait again for multi-CS if no CS completed The original multi-CS design assumption that stream masters are used exclusively (i.e. multi-CS with set of stream master QIDs will not get completed by CS not from the multi-CS set) is inaccurate. Thus multi-CS behavior is now modified not to treat such case as an error. Instead, if we have multi-CS completion but we detect that no CS from the list is actually completed we will do another multi-CS wait (with modified timeout). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Oded Gabbay	5b90e59d55	habanalabs: remove compute context pointer It was an error to save the compute context's pointer in the device structure, as it allowed its use without proper ref-cnt. Change the variable to a flag that only indicates whether there is an active compute context. Code that needs the pointer will now be forced to use proper internal APIs to get the pointer. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Oded Gabbay	4337b50b5f	habanalabs: add helper to get compute context There are multiple places where the code needs to get the context's pointer and increment its ref cnt. This is the proper way instead of using the compute context pointer in the device structure. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Oded Gabbay	6798676f7e	habanalabs: fix etr asid configuration Pass the user's context pointer into the etr configuration function to extract its ASID. Using the compute_ctx pointer is an error as it is just an indication of whether a user has opened the compute device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Oded Gabbay	357ff3dc9a	habanalabs: save ctx inside encaps signal Compute context pointer in hdev shouldn't be used for fetching the context's pointer. If an object needs the context's pointer, it should get it while incrementing its kref, and when the object is released, put it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Oded Gabbay	a4dd2ecf36	habanalabs: remove redundant check on ctx_fini The driver supports only a single context. Therefore, no need to check if the user context that is closed is the compute context. The user context, if exists, is always the compute context. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Oded Gabbay	fee187fe46	habanalabs: free signal handle on failure Fix a bug where in case of failure to allocate idr, the handle's memory wasn't freed as part of the error handling code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Tomer Tayar	b166465452	habanalabs: add missing kernel-doc comments for hl_device fields Add missing kernel-doc comments for the "last_error" and "stream_master_qid_arr" fields of the "hl_device" structure". Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:08 +02:00
Tomer Tayar	d214636be8	habanalabs: pass reset flags to reset thread The reset flags used by the reset thread are currently a mix of hard-coded values and a specific flag which is passed from the context that initiates the reset. To make it easier to pass more flags in future from this context to the reset thread, modify it to pass all the original reset flags to the thread. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Dani Liberman	2487f4a281	habanalabs: enable access to info ioctl during hard reset Because info ioctl is used to retrieve data, some of its opcodes may be used during hard reset. Other ioctls should be blocked while device is not operational. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Dani Liberman	1880f7acd7	habanalabs: add SOB information to signal submission uAPI For debug purpose, add SOB address and SOB initial counter value before current submission to uAPI output. Using SOB address and initial counter, user can calculate how much of the submmision has been completed. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Ohad Sharabi	4fac990f60	habanalabs: skip read fw errors if dynamic descriptor invalid Reporting FW errors involves reading of the error registers. In case we have a corrupted FW descriptor we cannot do that since the dynamic scratchpad is potentially corrupted as well and may cause kernel crush when attempting access to a corrupted register offset. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Ofir Bitton	3416d4b59b	habanalabs: handle events during soft-reset Driver should handle events during soft-reset as F/W is not going through reset and it keeps sending events towards host. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Ofir Bitton	b13bef2041	habanalabs: change misleading IRQ warning during reset Currently we dump the physical IRQ line index in host if an event is received during reset. This ID is confusing as it means nothing to the user. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Tomer Tayar	75a5c44d14	habanalabs: add power information type to POWER_GET packet In new f/w versions, it is required to explicitly indicate the power information type when querying the F/W for power info. When getting the current power level it should be set to power_input. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Ofir Bitton	4119433445	habanalabs: add more info ioctls support during reset Some info ioctls can be served even if the device is disabled or in reset. Hence, we enable more info ioctls during reset, as these ioctls do not require any H/W nor F/W communication. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Dani Liberman	3beaf903a3	habanalabs: fix race condition in multi CS completion Race example scenario: 1. User have 2 threads that waits on multi CS: - thread_0 waits on QID 0 and uses multi CS context 0. - thread_1 waits on QID 1 and uses multi CS context 1. 2. thread_1 got completion and release multi CS context 1. 3. CS related to multi CS of thread_0 starts executing complete_multi_cs function, the first iteration of the loop completes the multi CS of thread_0, hence multi CS context 0 is released. 4. thread_1 waits on QID 1 and uses multi CS context 0. 5. thread_0 waits on QID 0 and uses multi CS context 1. 6. The second iterattion of the loop (from step 3) starts, which means, start checking multi CS context 1: - multi CS contetxt is being used by thread_0 waiting on QID 0. - The fence of the CS (still CS from step 3) has QID map the same as the multi CS context 1. - multi CS context 1 (thread_0) gets completion on CS that triggered already thread_0 (with multi CS context 0) and is no longer being waited on. Fixed by exiting the loop in complete_multi_cs after getting completion Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Ofir Bitton	cad9eb4a8d	habanalabs: move device boot warnings to the correct location As device boot warnings clears the indication from the error mask, they must be located together before the unknown error validation. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:07 +02:00
Oded Gabbay	9eade72e72	habanalabs/gaudi: return EPERM on non hard-reset GAUDI supports only hard-reset. Therefore, this function should return an error of operation not permitted. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Oded Gabbay	6c1bad35e6	habanalabs: rename late init after reset function The ASIC-specific soft_reset_late_init() is now called after either soft-reset or reset-upon-device-release. Therefore, it needs a more appropriate name. No need to split it to two functions, as an ASIC either supports soft-reset or reset-upon-device-release. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Oded Gabbay	60e0431f41	habanalabs: fix soft reset accounting Reset upon device release is not a soft-reset from user/system point of view. As such, we shouldn't count that reset in the statistics we gather and expose to the monitoring applications. We also shouldn't print soft-reset when doing the reset upon device release. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Rajaravi Krishna Katta	d8eb50f31c	habanalabs: Move frequency change thread to goya_late_init Changing the frequency automatically is only done in Goya. In future ASICs this is done inside the firmware. Therefore, move the common code into the Goya specific files. Main changes as part of the commit are: 1. The thread for setting frequency is moved from device_late_init to goya_late_init 2. hl_device_set_frequency is removed from hl_device_open as it is not relevant for other ASICs and for Goya it is taken care by the thread 3. hl_device_set_frequency is renamed as goya_set_frequency Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Oded Gabbay	ab440d3e39	habanalabs: abort reset on invalid request Hard-reset is mutually exclusive with reset-on-device-release. Therefore, if such a request arrives to the reset function, abort the reset and return an error to the callee. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Ofir Bitton	a1b838adb0	habanalabs: fix possible deadlock in cache invl failure Currently there is a deadlock in driver in scenarios where MMU cache invalidation fails. The issue is basically device reset being performed without releasing the MMU mutex. The solution is to skip device reset as it is not necessary. In addition we introduce a slight code refactor that prints the invalidation error from a single location. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Ohad Sharabi	6f61e47a68	habanalabs: skip PLL freq fetch Getting the used PLL index with which to send the CPUPU packet relies on the CPUCP info packet. In case CPU queues are not enabled getting the PLL index will issue an error and in some ASICs will also fail the driver load. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Oded Gabbay	fe8d70873c	habanalabs: prevent false heartbeat message If a device reset has started, there is a chance that the heartbeat function will fail because the device is disabled at the beginning of the reset function. In that case, we don't want the error message to appear in the log. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:06 +02:00
Dani Liberman	3e55b5dbf9	habanalabs: add support for fetching historic errors A new uAPI is added for debug purposes of the user-space to retrieve errors related data from previous session (before device reset was performed). Inforamtion is filled when a razwi or CS timeout happens and can contain one of the following: 1. Retrieve timestamp of last time the device was opened and razwi or CS timeout happened. 2. Retrieve information about last CS timeout. 3. Retrieve information about last razwi error. This information doesn't contain user data, so no danger of data leakage between users. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Ofir Bitton	e2637fdca7	habanalabs: handle device TPM boot error as warning AS TPM error indication is not fatal, driver should dump a warning and continue booting. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Ofir Bitton	3eb7754ff4	habanalabs: debugfs support for larger I2C transactions I2C debugfs support is limited to 1 byte. We extend functionality to more than 1 byte by using one of the pad fields as a length. No backward compatibility issues as new F/W versions will treat 0 length as a 1 byte length transaction. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Oded Gabbay	e617f5f4c1	habanalabs: make hdev creation code more readable Divide the code into 3 different parts: - Copy kernel parameters - Setting device behaivor per asic - Fixup of various device parameters according to the device behaivor. In addition, remove non-relevant code for upstream (simulator support). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
farah kassabri	49c052dad6	habanalabs: add new opcodes for INFO IOCTL Add implementation for new opcodes in the INFO IOCTL: 1. Retrieve the replaced DRAM rows from f/w. 2. Retrieve the pending DRAM rows from f/w. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Bharat Jauhari	d4194f2140	habanalabs: refactor wait-for-user-interrupt function Refactor the wait-for-user-interrupt routine to make it more generic for re-use for other user exposed h/w interfaces in future ASICs. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
farah kassabri	792512459f	habanalabs/gaudi: Fix collective wait bug In Signaling-From-Graph case, the driver didn't set the hw_sob pointer at the right place, which is needed for the cs completion check prior to start sending all the master/slaves jobs to device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Ofir Bitton	1679c7ee58	habanalabs: expand clock throttling information uAPI In addition to the clock throttling reason, user should be able to obtain also the start time and the duration of the throttling event. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Dani Liberman	48f3116983	habanalabs: change wait for interrupt timeout to 64 bit In order to increase maximum wait-for-interrupt timeout, change it to 64 bit variable. This wait is used only by newer ASICs, so no problem in changing this interface at this time. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Bharat Jauhari	234caa5273	habanalabs: rename reset flags Rename reset flags for better readability as compared to HL_RESET_CAUSE* enum shared with the f/w. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:05 +02:00
Rajaravi Krishna Katta	e84e31a912	habanalabs: add dedicated message towards f/w to set power CPUCP_PACKET_POWER_GET packet type was used for both hl_get_power() and hl_set_power(). To align with other sensor functions hl_set_power() should use CPUCP_PACKET_POWER_SET. This packet will only be used with newer ASICs, so need to add a compatibility flag to the asic properties to indicate whether to use this packet or the GET packet. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Bharat Jauhari	1388582264	habanalabs: handle abort scenario for user interrupt In case of device reset, the driver does a force trigger on all waiting users to release them from waiting. However, the driver does not handle error scenario while waiting. hl_interrupt_wait_ioctl() now exits the wait in case of an error with abort status. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Ohad Sharabi	5edd95a4ab	habanalabs: don't clear previous f/w indications Once we read indication of whether f/w is doing the reset, we don't want to clear it, until the next time we read this indication. Otherwise, we might be in a state of wrong indication. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Ohad Sharabi	f4e7906dbe	habanalabs: use variable poll interval for fw loading Using a variable poll interval for fw loading allows us to support much slower environments (emulation) while changing only a single line in the code, instead of choosing a different interval in each function that polls. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Ohad Sharabi	8f82ff75df	habanalabs: adding indication of boot fit loaded Up until now the driver stored indication if Linux was loaded on the device CPU. This was needed in order to coordinate some tasks that are performed by the Linux. In future ASICs, many of those tasks will be performed by the boot fit, so now we need the same indication of boot fit load status. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Yuri Nudelman	6ccba9a3bc	habanalabs: partly skip cache flush when in PMMU map flow The PCI MMU cache is two layered. The upper layer, memcache, uses cache lines, the bottom layer doesn't. Hence, after PMMU map operation we have to invalidate memcache, to avoid the situation where the new entry is already in the cache due to its cache line being fully in the cache. However, we do not have to invalidate the lower cache, and here we can optimize, since cache invalidation is time consuming. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Yuri Nudelman	82e5169e8a	habanalabs: add enum mmu_op_flags The enum vm_type was abused, used once as a value (indication memory type for map) and once as a flag (for cache invalidation). This makes it hard to add new and still keep it meaningful, hence it is better to split into one enum for values and one for flags. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Yuri Nudelman	89d6decdb7	habanalabs: make last_mask an MMU property Currently LAST_MASK is a global, but really it is an MMU implementation specific. We need this change for future ASICs. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Yuri Nudelman	f06bad02b5	habanalabs: wrong VA size calculation VA blocks are currently stored in an inconsistent way. Sometimes block end is inclusive, sometimes exclusive. This leads to wrong size calculations in certain cases, plus could lead to a segmentation fault in case mapping process fails in the middle and we try to roll it back. Need to make this consistent - start inclusive till end inclusive. For example, the regions table may now look like this: 0x0000 - 0x1fff : allocated 0x2000 - 0x2fff : free 0x3000 - 0x3fff : allocated Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:04 +02:00
Guy Zadicario	90d283b672	habanalabs/gaudi: fix debugfs dma channel selection Do not use a dma channel for debugfs requested transfer if it's QM is not idle. Signed-off-by: Guy Zadicario <gzadicario@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:03 +02:00
Ohad Sharabi	bfd5110682	habanalabs: revise and document use of boot status flags The boot status flag "SRAM available" can be set by f/w Linux (in the general case) or by f/w uboot (in some specific debug scenario) but never by f/w preboot. Hence, when polling the boot status flags in the preboot stage we do not want to poll on "SRAM Avialable". The special case in which uboot set this flag is when we are running special debug scenario without Linux. In this case, at some point during the boot, the uboot relocates its code to the DRAM and then set the specified flag. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:03 +02:00
Yuri Nudelman	ba3aca31f9	habanalabs: print va_range in vm node debugfs VA range info could assist in debugging VA allocation bugs. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:03 +02:00
Oded Gabbay	4cd454a205	habanalabs/gaudi: recover from CPU WD event There are rare cases where the device CPU's watchdog has expired and as a result, the watchdog reset has happened and the CPU will now move to running its preboot f/w. When that happens, the driver will only know that a heartbeat failure occurred. As a result, the driver will send a message to the CPU's main f/w asking it to reset the device, but because the CPU is now running preboot, it won't respond and the re-initialization process will later fail when trying to load the f/w. The solution is to send the request to the preboot as well, only if the reset was caused because of HB failure. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:03 +02:00
Ohad Sharabi	c9d1383c75	habanalabs: modify wait for boot fit in dynamic FW load In the dynamic FW load protocol the boot status is updated to "Ready to Boot" once uboot is active. Polling on other boot status values is a residue of code duplication from the static protocol and should be removed. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-12-26 08:59:03 +02:00
Greg Kroah-Hartman	16b0314aa7	dma-buf: move dma-buf symbols into the DMA_BUF module namespace In order to better track where in the kernel the dma-buf code is used, put the symbols in the namespace DMA_BUF and modify all users of the symbols to properly import the namespace to not break the build at the same time. Now the output of modinfo shows the use of these symbols, making it easier to watch for users over time: $ modinfo drivers/misc/fastrpc.ko \| grep import import_ns: DMA_BUF Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: dri-devel@lists.freedesktop.org Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sumit Semwal <sumit.semwal@linaro.org> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20211010124628.17691-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-10-25 14:53:08 +02:00
Dani Liberman	b2faac3887	habanalabs: refactor fence handling in hl_cs_poll_fences To avoid checking if fence exists multipled times, changed fence handling to depend only on the fence status field: Busy, which means CS still did not completed : Add its QID so multi CS wait on its completion. Finished, which means CS completed and fence exists: Raise its completion bit if it finished mcs handling and update if necessary the earliest timestamp. Gone, which means CS already completed and fence deleted: Update multi CS data to ignore timestamp and raise its completion bit. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:48 +03:00
Omer Shpigelman	fae132632c	habanalabs: context cleanup cosmetics No need to check the return value if the following action is the same for both cases. In addition, now that hl_ctx_free() doesn't print if the context is not released, its name can be misleading as the context might stay alive after it is executed with no indication for that. Hence we can discard it and simply put the refcount. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:48 +03:00
Yuri Nudelman	d2f5684b8f	habanalabs: simplify wait for interrupt with timestamp flow Remove the flag that determines whether to take a timestamp once the interrupt arrives. Instead, always take the timestamp once per interrupt. This is a must for the user-space to measure its graph operations to evaluate the graph computation time. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Moti Haimovski	4a18dde5e4	habanalabs: initialize hpriv fields before adding new node When adding a new node to the hpriv list, the driver should initialize its fields before adding the new node. Otherwise, there may be some small chance of another thread traversing that list and accessing the new node's fields without them being initialized. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Rajaravi Krishna Katta	024b7b1d6d	habanalabs: Unify frequency set/get functionality Make the frequency set/get functionality common to all ASICs. This makes more code reusable when adding support for newer ASICs. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Vegard Nossum	f6fb34390c	habanalabs: select CRC32 Fix the following build/link error by adding a dependency on the CRC32 routines: ld: drivers/misc/habanalabs/common/firmware_if.o: in function `hl_fw_dynamic_request_descriptor': firmware_if.c:(.text.unlikely+0xc89): undefined reference to `crc32_le' Fixes: `8a43c83fec` ("habanalabs: load boot fit to device") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Tomer Tayar	db1a8dd916	habanalabs: add support for dma-buf exporter Implement the calls to the dma-buf kernel api to create a dma-buf object backed by FD. We block the option to mmap the DMA-BUF object because we don't support DIRECT_IO and implicit P2P. We only implement support for explicit P2P through importing the FD of the DMA-BUF. In the export phase, we provide to the DMA-BUF object an array of pages that represent the device's memory area. During the map callback, we convert the array of pages into an SGT. We split/merge the pages according to the dma max segment size of the importer. To get the DMA address of the PCI bar, we use the dma_map_resources() kernel API, because our device memory is not backed by page struct and this API doesn't need page struct to map the physical address to a DMA address. We set the orig_nents member of the SGT to be 0, to indicate to other drivers that we don't support CPU mappings. Note that in Habanalabs's ASICs, the device memory is pinned and immutable. Therefore, there is no need for dynamic mappings and pinning callbacks. Also note that in GAUDI we don't have an MMU towards the device memory and the user works on physical addresses. Therefore, the user doesn't pass through the kernel driver to allocate memory there. As a result, only for GAUDI we receive from the user a device memory physical address (instead of a handle) and a size. We check the p2p distance using pci_p2pdma_distance_many() and refusing to map dmabuf in case the distance doesn't allow p2p. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Dani Liberman	81f8582ec4	habanalabs: fix NULL pointer dereference When polling fences for multi CS, it is possible that fence is no longer exists (its corresponding CS completed and the fence was deleted) but we still accessing its parameters, causing NULL pointer dereference. Fixed by checking if fence exits before accessing its parameters. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Dani Liberman	ea6eb91c09	habanalabs: fix race condition in multi CS completion Race condition occurs when CS fence completes and multi CS did not completed yet, while waiting for multi CS ends and returns indication to user that the CS completed. Next wait for multi CS may be triggered by previous multi CS completion without any current CS completed, causing an error. Example scenario : 1. User do multi CS wait for CSs 1 and 2 on master QID 0 2. CS 1 and 2 reached the "cs release" code. The thread of CS 1 completed both the CS and multi CS handling but the completion thread of CS 2 completed the CS but still did not executed complete_multi_cs (note that in CS completion the sequence is to first do complete all for the CS and then another complete all to signal the multi_cs) 3. User received indication that CS 1 and 2 completed (since we check the CS fence and both indicated as completed) and immediately waits on CS 3 and 4, also on master QID 0. 4. Completion thread of CS2 executed complete_multi_cs before completion of CS 3 and 4 and so will trigger the multi CS wait of CSs 3 and 4 as they wait on master QID 0. This will trigger multi CS completion although none of its current CS has been completed. Fixed by adding multi CS complete handling indication for each CS. CS will be marked to the user as completed only if its fence completed and multi CS handling is done. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Oded Gabbay	1d16a46b1a	habanalabs: use only u32 In the kernel it is common to use u32 and not uint32_t. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Oded Gabbay	efc6b04b86	habanalabs: update firmware files Update the firmware headers to the latest version Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Bharat Jauhari	10cab81d1c	habanalabs: bypass reset for continuous h/w error event There may be a situation where drivers receives continuous fatal H/W error events from FW immediately post reset cycle. This may be due to some fault on the silicon itself. In such case its better to bypass reset cycle so we won't be stuck in endless loop of resets. This commit bypasses reset request in case driver received two back to back FW fatal error before first occurrence of heartbeat event. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Yuri Nudelman	f05d17b226	habanalabs: take timestamp on wait for interrupt Taking an accurate timestamp in a close proximity of the interrupt is required for user side statistics management. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	c1904127ce	habanalabs: prevent race between fd close/open The driver allows only a single process to open a device's FD at any single time. This is done by checking "hdev->compute_ctx" under mutex. Therefore, to prevent a race between the moment a user closes it's FD and when another user tries to open the device, we need to make sure that clearing this variable is the very last thing that is done in the code of the FD's release. I'm moving the idle check before clearing this variable and the "reset on device release". btw, if the reset happens it will prevent any other user from opening the device until the reset is finished. An important thing to note is that we need to remove the user process that is closing the device from the process list BEFORE calling the reset function. That is to prevent a case where the reset code will try to kill that user process and it is unnecessary as the process doesn't hold any device/driver resources anymore. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	1282dbbd29	habanalabs: refactor reset log message Reset to the device is not necessarily due to an error, so print it as info instead of error. In addition, print the type of reset we are doing: - reset of the entire device (aka hard reset) - reset of the device after user have released it (less than hard reset) - lighter reset of an inference device (aka soft reset) Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	a00f1f571e	habanalabs: define soft-reset as inference op Soft-reset is the procedure where we reset only the compute/DMA engines of the device, without requiring the current user-space process to release the device. This type of reset can happen if TDR event occurred (a workload got stuck) or by a root request through sysfs. This is only relevant for inference ASICs, as there is no real-world use-case to do that in training, because training runs on multiple devices. In addition, we also do (in certain ASICs) a reset upon device release. That reset uses the same code as the soft-reset. Therefore, to better differentiate between the two resets, it is better to rename the soft-reset support as "inference soft-reset", to make the code more self-explanatory. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Yuri Nudelman	dd08335fb9	habanalabs: fix debugfs device memory MMU VA translation The translation in debugfs of device memory MMU virtual addresses was wrong as it did not take into consideration the fact that the page sizes there can be not power of 2. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Ofir Bitton	d62b9a6976	habanalabs: add support for a long interrupt target value In order to avoid user target value wraparound, we modify the current interface so user will be able to wait for an 8-byte target value rather than a 4-byte value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Ofir Bitton	027d53b03c	habanalabs: remove redundant cs validity checks During TDR handling, we check multiple times if CS is valid. No need to perform this check as CS must be valid at all time during the TDR handling. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Rajaravi Krishna Katta	2b28485d0a	habanalabs: enable power info via HWMON framework Add support to retrieve following power info via HWMON: - instantaneous power value - highest value since last reset - reset the highest place holder Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00

... 2 3 4 5 6 ...

1090 Commits