OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Oded Gabbay	259cee1c24	habanalabs: eliminate aggregate use warning When doing sizeof() and giving as argument a dereference of a pointer-to-a-pointer object, clang will issue a warning. Eliminate the warning by passing struct <name>* Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-20 15:52:27 +03:00
farah kassabri	6b9b9e244f	habanalabs: remove some f/w descriptor validations To be forward-backward compatible with the firmware in the initial communication during preboot, we need to remove the validation of the header size. This will allow us to add more fields to the lkd_fw_comms_desc structure. Instead of the validation of the header size, we just print warning when some mismatch in descriptor has been revealed, and we calculate the CRC base on descriptor size reported by the firmware instead of calculating it ourselves. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-20 15:46:45 +03:00
Oded Gabbay	4f3ce5e0d0	habanalabs: failure to open device due to reset is debug level If the user wants to open the device, and the device is currently in reset, the user will get an error from the open(). We don't need to display an error in the dmesg for that as it is not a real error and we can spam the kernel log with this message. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:09:04 +03:00
Dani Liberman	0c88760f8f	habanalabs/gaudi2: add secured attestation info uapi User will provide a nonce via the ioctl, and will retrieve secured attestation data of the boot, generated using given nonce. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:40 +03:00
Dani Liberman	97a78e3d8e	habanalabs: rename error info structure As a preparation for adding more errors to it, change to more suitable name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:39 +03:00
Oded Gabbay	82736b063f	habanalabs: MMU invalidation h/w is per device The code used the mmu mutex to protect access to the context's page tables and invalidation of the MMU cache. Because pgt are per context, the mmu mutex was a member of the context object. The problem is that the device has a single MMU invalidation h/w (per MMU). Therefore, the mmu mutex should not be a property of the context but a property of the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:39 +03:00
Ohad Sharabi	76925f55c9	habanalabs: fix resetting the DRAM BAR Current code does not takes into account the new DRAM region base and so calculated address is wrong and can lead to crush. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:38 +03:00
Ofir Bitton	0626fa1a4d	habanalabs: add support for new cpucp return codes Firmware now responds with a more detailed cpucp return codes. Driver can now distinguish between error and debug return codes. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:38 +03:00
Tomer Tayar	a0fc8688c0	habanalabs/gaudi2: read F/W security indication after hard reset F/W security status might change after every reset. Add the reading of the preboot status to the hard reset sequence, which among others reads this security indication. As this preboot status reading includes the waiting for the preboot to be ready, it can be removed from the CPU init which is done in a later stage. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:38 +03:00
farah kassabri	62adba0a55	habanalabs: fix possible hole in device va cb_map_mem() uses gen_pool_alloc() to get virtual address for mapping a CB. The mapping is done in chunks of page size, so if the CB size is larger, it is possible that the allocated virtual addresses won't be consecutive. User retrieves this device VA which returns the virtual address in the first va_block. If there is a "hole" in the virtual addresses, user can configure a HW block with a bad device VA. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:38 +03:00
Ofir Bitton	f5ec364c9e	habanalabs: send device activity in a proper context 'Device activity open packet' should be sent outside of mutex as there is no real necessity for a lock. In addition 'device activity close packet' should be sent upon an actual release of the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:37 +03:00
farah kassabri	4745b2f0d0	habanalabs: send device active message to f/w As part of the RAS that is done by the f/w, we should send a message to the f/w when a user either acquires or releases the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-19 15:08:37 +03:00
Ofir Bitton	d155df4f62	habanalabs: ignore EEPROM errors during boot EEPROM errors reported by firmware are basically warnings and should not fail the boot process. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:53 +03:00
Ofir Bitton	c38f72370b	habanalabs: perform context switch flow only if needed Except Goya, none of our ASICs require context switch flow, hence we enable this flow only where it is needed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:53 +03:00
Dafna Hirschfeld	262042af13	habanalabs: set command buffer host VA dynamically Set the addresses for userspace command buffer dynamically instead of hard-coded. There is no reason for it to be hard-coded. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:53 +03:00
Ohad Sharabi	0263256791	habanalabs: trace DMA allocations This patch add tracepoints in the code for DMA allocation. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. To get better understanding of what happened in the DMA allocations the real allocating function is added to the trace as well. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:53 +03:00
Ohad Sharabi	4eb87df3d0	habanalabs: trace MMU map/unmap page This patch utilize the defined tracepoint to trace the MMU's pages map/unmap operations. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:53 +03:00
Ohad Sharabi	191a4443c3	habanalabs: define trace events This patch adds trace events for habanalabs driver to gain all the benefits such an infrastructure can supply. The following events were added: - MMU map/unmap: to be able to track driver's memory allocations - DMA alloc/free: to track our DMA allocation the above trace points in conjunction will help us map the device memory usage as well as to be able to track memory violations. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Acked-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:53 +03:00
Tomer Tayar	fb855768d3	habanalabs: fix calculation of DRAM base address in PCIe BAR The calculation of the device DRAM base address before setting the relevant PCIe BAR to point at it, has an assumption that this BAR is used to access only the DRAM, and thus the covered DRAM size is a power of 2. In future ASICs it is not necessarily true, so need to update the calculation to support also a non-power-of-2 size. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Dafna Hirschfeld	46e49f434f	habanalabs: if map page fails don't try to unmap it The original code tried to unmap a page that was not mapped as part of the map page error path. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Omer Shpigelman	273190d420	habanalabs: add cdev index data member Instead of recalculating the cdev index, store it in a dedicated data member. This data member is intended to be passed to other drivers using the auxiliary bus infra and hence this new data member is necessary in case that the calculation is changed in the future. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Dafna Hirschfeld	75bc3986fc	habanalabs: fix bug when setting va block size the size of a block is always 'block->end - block->start + 1' Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Ofir Bitton	38a4358009	habanalabs: expose device security status using info ioctl In order for the user to know if he is running on a secured device or not, we add it also to the hw_ip info ioctl. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Ofir Bitton	6457271f64	habanalabs: expose device security status through sysfs In order for the user to know if he is running on a secured device or not, a sysfs node is added. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Ofir Bitton	107a5bcc0b	habanalabs: remove secured PCI IDs Secured PCI ID will not be supported in new asics because the security status can always be read from the f/w. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:52 +03:00
Tomer Tayar	65d3c63513	habanalabs: fix H/W block handling for partial unmappings Several munmap() calls can be done or a mapped H/W block that has a larger size than a page size. Releasing the object should be done only when all mapped range is unmapped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Dani Liberman	07ecaa0d85	habanalabs: unify hwmon resources clean up Since hwmon fini code is common for all asics, unified it to common function. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Tal Cohen	194e515c79	habanalabs/gaudi2: new API to control engine cores running mode The current flow of halting the engine cores is implemented by command buffers built by the user space and sent towards the Driver. This current flow is broken since the user space does not know when the cores actually halt as sending a workload is async op. Therefore the application can not free the memory that is mapped to the engine cores. This new API allows the user space to control the running mode. The API call is sync (returns after the cores are set to the requested mode). Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Oded Gabbay	07056f58e4	habanalabs: remove left-over code from bring-up There is some left-over code from the gaudi2 bring-up that wasn't removed so far. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
farah kassabri	6419b5232e	habanalabs/gaudi2: change device f/w security check On Gaudi2 the f/w always configures the PCIe iATU and allows access to scratchpad registers. Therefore, we can know if the f/w is secured by reading a status bit from the f/w registers. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Oded Gabbay	ab6c08f0d5	habanalabs: move common function out of debugfs.c A common function that is called from multiple places can't be located in degugfs.c because that file is only compiled if debugfs is enabled in the kernel config file. This can lead to undefined symbol compilation error. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Tomer Tayar	f0d4944c20	habanalabs: add a missing lock for in_reset indication Add a missing lock in hl_device_resume() when it assigns a value to the 'in_reset' indication. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Tomer Tayar	5f46217221	habanalabs: fix vma fields assignments order in hl_hw_block_mmap() In hl_hw_block_mmap(), the vma's 'vm_private_data' and 'vm_ops' fields are assigned before filling the content of the private data. In between there is a call to the ASIC hw_block_mmap() function, and if it fails, the vma close function will be called with a bad private data value. Fix the order of assignments to avoid this issue. In hl_hw_block_mmap() the vma's 'vm_private_data and vm_ops are assigned before setting the Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:51 +03:00
Tomer Tayar	7fa6c0fe8b	habanalabs: avoid returning a valid handle if map_block() fails map_block() sets the block id handle even if get_hw_block_id() fails, and in this case it uses block id 0 which might be a valid id. Modify it to set the handle only if get_hw_block_id() succeeds. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:50 +03:00
Tal Cohen	0c876b47a5	habanalabs: fix command submission sanity check When a CS is submitted, the ioctl handler checks the CS flags and performs a sanity check, according to its value. As new CS flags are added, the sanity check needs to be updated according to the new flags. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:50 +03:00
Dani Liberman	f018c54e3d	habanalabs: add uapi to retrieve engines status Currently, to get engines status, user needed to read debugfs file with root permissions. This new uapi allows user apace apps retrieve status, so for example, in case of failure, status can be retrieved immediately by the application itself which runs without root permissions. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:50 +03:00
Oded Gabbay	5f92c1e296	habanalabs: remove all kdma locks We don't use KDMA concurrently in the driver. The only use is through debugfs and we don't protect concurrent access through it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:50 +03:00
Ohad Sharabi	0c819c9a04	habanalabs: wrap macro arg with parentheses The macro argument <val> is cast-ed to u32 in some of the places. Because this arg can be some arithmetic computation (e.g. address + offset) the cast should be on the whole expression. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:50 +03:00
Bharat Jauhari	f25a72b8b9	habanalabs: fix spelling mistakes Cosmetic commit, no logical changes. It just fixes the spelling mistakes. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:50 +03:00
Oded Gabbay	cd6b0cea89	habanalabs/gaudi: increase default cs timeout to 10 minutes In order to improve scalability and reduce host overhead, it is better to increase the default TDR timeout of Gaudi1 from 30 seconds to 10 minutes. This will allow the DL Framework (e.g. PyTorch, TensorFlow) to remove the host sync they are using now and improve overall performance on scaleout training. Note that one can always set the timeout to a custom value via a kernel module parameter given during driver load. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:49 +03:00
Ohad Sharabi	913bd4179b	habanalabs: add return code field to module iterator Up until now the module iterator called void callback functions and so caller activating callback that may fail suffered from 2 issues: 1. The need to "plant" return called in the private data. This is a drawback since the iterator itself should not be aware of the private data of the caller. 2. Due to 1 even in a failure the iterator would keep iterating instead of break upon error. To overcome this an optional rc field added to the iterator context. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:49 +03:00
Ofir Bitton	bc9b271e6c	habanalabs: rename non_hard_reset to compute_reset In order to be more explicit we should use the term compute_reset for describing the reset in which only the compute engines gets reset. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:49 +03:00
Dani Liberman	71386e11f2	habanalabs: removed seq_file parameter from is_idle asic functions Change is_idle functions so it would be more usable outside debugfs. Do this by replacing seq_file parameter with regular string. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-09-18 13:29:49 +03:00
Oded Gabbay	0b0ae02440	habanalabs: rename soft reset to compute reset Doing compute reset can be the traditional inference soft reset that is supported only in Goya. Or it can be the new reset upon device release, which is supported in Gaudi2 and above. Therefore, wherever suitable, use the terminology of compute reset instead of soft reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Oded Gabbay	e3b20f3ee4	habanalabs: add status of reset after device release The user might want to know the device is in reset after device release, which is not an erroneous event as a regular reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Oded Gabbay	bd4a338886	habanalabs: fix update of is_in_soft_reset reset_info.is_in_soft_reset should be updated both before in_reset and inside the spin lock of the reset info structure. The reasons are: - When we are inside soft reset, it implies we are in reset. Therefore, if someone checks if we are in soft reset, he can deduce we are in reset, while the opposite is not correct and might be misleading. - Both these flags are changed together so they must be changed inside the reset info spinlock. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Ofir Bitton	08f0aa9548	habanalabs: expose only valid debugfs nodes In case security is enabled on the device, some debugfs nodes will fail. Hence, we do not expose them. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:31 +03:00
Tomer Tayar	af2e650b36	habanalabs: add a value field to hl_fw_send_pci_access_msg() For gaudi2 we need to send a value to F/W as part of the PCI_ACCESS packet. As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value' field. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00
Ohad Sharabi	20cd88a775	habanalabs: fixes to the poll-timeout macros - use conventional internal macro variables (double underscore prefix) - adjust address casting - on register poll using ELBI use ELBI read rather than BAR read on error condition - remove unused macro Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00
Oded Gabbay	b596ad6f11	habanalabs: initialize variable explicitly Fix warning of "warning: ‘old_base’ may be used uninitialized in this function" Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2022-07-12 09:09:30 +03:00

1 2 3 4 5 ...

578 Commits