Commit Graph

1090 Commits

Author SHA1 Message Date
Jiapeng Chong 18b6731538 habanalabs: Fix kernel-doc
Fix the following W=1 kernel warnings:

drivers/misc/habanalabs/common/mmu/mmu_v1.c:425: warning: expecting
prototype for hl_mmu_fini(). Prototype was for hl_mmu_v1_fini() instead.

drivers/misc/habanalabs/common/mmu/mmu_v1.c:449: warning: expecting
prototype for hl_mmu_ctx_init(). Prototype was for hl_mmu_v1_ctx_init()
instead.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:21 +03:00
Jiapeng Chong 858e6d4085 habanalabs: Fix kernel-doc
Fix the following W=1 kernel warnings:

drivers/misc/habanalabs/common/pci/pci.c:454: warning: expecting
prototype for hl_fw_fini(). Prototype was for hl_pci_fini() instead.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:21 +03:00
Dan Carpenter a43a9f6777 habanalabs: fix double unlock on error in map_device_va()
If hl_mmu_prefetch_cache_range() fails then this code calls
mutex_unlock(&ctx->mmu_lock) when it's no longer holding the mutex.

Fixes: 9e495e2400 ("habanalabs: do MMU prefetch as deferred work")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:21 +03:00
Tal Cohen 90de680526 habanalabs: use separate structure info for each error collect data
Create separate info structure for each error type.
The structures shall be used inside the large structure that contains
the last session error.
This is more scalable for adding more errors in the future.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Yuri Nudelman f873a27fd5 habanalabs: fix missing handle shift during mmap
During mmap operation on the unified memory manager buffer, the vma
page offset is shifted to extract the handle value. Due to a typo, it
was not shifted back at the end. That could cause the offset to be
modified after mmap operation, that may affect subsequent operations.
In addition, in allocation flow, in case of out of memory error, idr
would not be correctly destroyed, again because of a missing shift.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Ohad Sharabi e31dd9362f habanalabs: remove hdev from hl_ctx_get args
This argument is unused by the function.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Ohad Sharabi 9e495e2400 habanalabs: do MMU prefetch as deferred work
When user requests to prefetch the MMU translations, the driver will
not block the user until prefetch is done.
Instead, the prefetch work will be delegated to a WQ which will do it
in the background.
This way, the prefetch may progress without blocking the user at all.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Yuri Nudelman 83617f5a87 habanalabs: order memory manager messages
Changing format of memory manager messages to make it more readable. In
addition, reducing the priority of a warning on missing handle during
put. This scenario is not an indication of a problem and may happen in
a legal flow, when handle is put from multiple flows. For example, in
timeout and completion.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:21 +02:00
Oded Gabbay 804d514d79 habanalabs: return -EFAULT on copy_to_user error
If copy_to_user failed in info ioctl, we always return -EFAULT so the
user will know there was an error.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Oded Gabbay 49d2a8af97 habanalabs: use NULL for eventfd
eventfd is pointer. As such, it should be initialized to NULL, not to 0.

In addition, no need to initialize it after creation because the
entire structure is zeroed-out. Also, no need to initialize it before
release because the entire structure is freed.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Oded Gabbay 368b0b4fd6 habanalabs: update firmware header
Update cpucp_if.h to latest version.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Tal Cohen 422ef17103 habanalabs: add support for notification via eventfd
The driver will be able to send notification events towards
a user process, using user's registered event file descriptor.
The driver uses the notification mechanism to inform the
user about an occurred event.
A user thread can wait until a notification is received from
the driver.

The driver stores the occurred event until the user reads it,
using HL_INFO_GET_EVENTS - new ioctl opcode in the INFO ioctl.

Gaudi specific implementation includes sending a notification
on a TPC assertion event that is received from f/w.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Yuri Nudelman f2daa2d97e habanalabs: add topic to memory manager buffer
Currently, buffers from multiple flows pass through the same infra.
This way, in logs, we are unable to distinguish between buffers that
came from separate flows.
To address this problem, add a "topic" to buffer behavior
descriptor - a string identifier that will be used to identify in logs
the flow this buffer relates to.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Dani Liberman c37803388c habanalabs: handle race in driver fini
Scenario:

1. During hard reset, driver executes device_kill_open_processes.
2. Drivers file descriptor is not closed yet (user process is alive),
   hence we are starting loop on all open file descriptors.
3. Just before getting task struct of user process, according to
   pid, SIGKILL is sent to the user process, hence get_pid_task
   fails, driver prints a warning and device_kill_open_processes
   returns an error.
4. Returned error causing driver fini do disable the device object
   of the process which causes a kernel crash.

The fix is to handle this case not as an error and continue fini flow
as normal, since the killed process (by the SIGKILL) will release its
resources just like it will do when the driver sends him the sigkill.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Dafna Hirschfeld 0688474eda habanalabs: add device memory scrub ability through debugfs
Add the ability to scrub the device memory with a given value.
Add file 'dram_mem_scrub_val' to set the value
and a file 'dram_mem_scrub' to scrub the dram.

This is very important to help during automated tests, when you want
the CI system to randomize the memory before training certain
DL topologies.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:20 +02:00
Yuri Nudelman 829ec038c9 habanalabs: use unified memory manager for CB flow
With the new code required for the flow added, we can now switch
to using the new memory manager infrastructure, removing the old code.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Yuri Nudelman dc653c36c9 habanalabs: unified memory manager new code for CB flow
This commit adds the new code needed for command buffer flow using the
new unified memory manager, without changing the actual functionality.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Ofir Bitton 2db04a6826 habanalabs/gaudi: set arbitration timeout to a high value
In certain workloads, arbitration timeout might expire although
no actual issue present. Hence, we set timeout to a very high value.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Yuri Nudelman ff086c186b habanalabs: add put by handle method to memory manager
Putting object by its handle and not by object pointer is useful in
some finalization flows that do not have object pointer available.
It eliminates the need to first get the object and then perform
put twice.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Yuri Nudelman 4e63ce6af6 habanalabs: hide memory manager page shift
The new unified memory manager uses page offset to pass buffer handle
during the mmap operation. One problem with this approach is that it
requires the handle to always be divisible by the page size, else, the
user would not be able to pass it correctly as an argument to the mmap
system call.

Previously, this was achieved by shifting the handle left after alloc
operation, and shifting it right before get operation. This was done in
the user code. This creates code duplication, and, what's worse,
requires some knowledge from the user regarding the handle internal
structure, hurting the encapsulation.

This patch encloses all the page shifts inside memory manager functions.
This way, the user can take the handle as a black box, and simply use
it, without any concert about how it actually works.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
farah kassabri de3484dfaa habanalabs: Add separate poll interval value for protocol
Currently we're using the same poll interval value for both
COMMs protocol(for sending a command and waits for an ACK)
and the device CPU boot phases status waits.
On COMMs protocol this interval should be much lower than the
device CPU boot which may take long time to change status.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:19 +02:00
Dani Liberman b0b09b7a8b habanalabs: use get_task_pid() to take PID
find_get_pid() isn't good in case the user process was run inside
docker.

As a result, we didn't had the PID and we couldn't kill the user
process in case the device got stuck and we needed to reset the
device.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:18 +02:00
Ohad Sharabi 5d1a0de2c7 habanalabs: add prefetch flag to the MAP operation
This patch let the user decide whether the translations done in the
page tables will be fetched directly to the STLB right after the map.

We want to let the user control whether to perform prefetch upon map
operation.

To do so a memory flag was added, to be used in the MAP ioctl, called
HL_MEM_PREFETCH and if set- the mappings will be fetched directly to
the STLB after map operation.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:18 +02:00
Robin Murphy 77c97a7ea8 habanalabs: Stop using iommu_present()
Even if an IOMMU might be present for some PCI segment in the system,
that doesn't necessarily mean it provides translation for the device
we care about. Replace iommu_present() with a more appropriate check.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:18 +02:00
Moti Haimovski 0ff1d6f8f5 habanalabs: support debugfs Byte access to device DRAM
The habanalabs HW requires memory resources to be used by its
internal hardware structures. These structures are allocated and
initialized by the driver. We would like to use the device HBM for
that purpose. This memory is io-remapped and accessed using the
writel()/writeb()/writew() commands.
Since some of the HW structures are one byte in size we need to
add support for the  writeb() and readb() functions in the driver.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:18 +02:00
Ohad Sharabi ab4ea58728 habanalabs: use for_each_sgtable_dma_sg for dma sgt
Instead of using for_each_sg when iterating sgt that contains dma
entries, use the more proper for_each_sgtable_dma_sg macro.

In addition, both Goya and Gaudi have the exact same implementation
of the asic function that encapsulate the usage of this macro, so
it is better to move that implementation to the common code.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:18 +02:00
Rajaravi Krishna Katta b8d852add6 habanalabs/gaudi: use lower_32_bits() for casting
Use standard kernel macro to take lower 32 bits of 64-bits variable.

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:18 +02:00
Ohad Sharabi 2ba75d3119 habanalabs: refactor HOP functions in MMU V1
Take advantage of the HOPs shift/masks now defined as arrays.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Rajaravi Krishna Katta b31848430f habanalabs: fix comments according to kernel-doc
Incorrect/Missing doxygen tag

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Oded Gabbay 658591ec32 habanalabs: remove user interrupt debug print
As user interrupts are a common use case, this dump pollutes the
dmesg log, hence removing it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Oded Gabbay c82b025f2b habanalabs: don't print normal reset operations
Only a hard-reset is an unexpected event which should be notify in
the kernel log. Other resets are normal operations and therefore
we should not pollute the log with them.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Oded Gabbay 738607f005 habanalabs: change a reset print to debug level
Currently we have two reset prints per reset. One is in the common
code and one in each asic-specific file.

We can change the asic-specific message to be debug only as we can
know the type of reset being done according to the print in the
common code, which is also easier to maintain.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Oded Gabbay fcadbf5688 habanalabs: remove redundant info print
Halting compute engines is a print that doesn't add us any information
because it is always done in the reset process and not used elsewhere.

Even if it was, we don't use prints to mark functions we passed
through.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:17 +02:00
Yuri Nudelman cd92c3678a habanalabs: wrong handle removal in memory manager
During the unified memory manager release, a wrong id was used to remove
an entry from the idr.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:16 +02:00
Dafna Hirschfeld 799b9eb01a habanalabs: remove debugfs read/write callbacks
The debugfs memory access now uses the callback 'access_dev_mem'
so there is no use of the callbacks
'debugfs_{read32,read64,write32,write6}'. Remove them.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:16 +02:00
Dafna Hirschfeld 9248aa90d2 habanalabs: enforce alignment upon registers access through debugfs
When accessing the configuration registers through debugfs,
it is only allowed to access aligned address.
Fail if address is not aligned.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:16 +02:00
Dafna Hirschfeld ee8a10c833 habanalabs: unify code for memory access from debugfs
Currently each asic version implements 4 callbacks:
'debugfs_{read32/write32/read64/write64}'
There is a lot of code duplication among the different
callbacks of all asic versions.
This patch unify the code in order to avoid the code
duplication by iterating the pci_mem_region array
in hl_device and use its fields instead of macros.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:16 +02:00
Dafna Hirschfeld 234366d3b6 habanalabs: add callback and field to be used for debugfs refactor
This is a preparation for unifying the code of accessing device memory
through debugfs. Add struct fields and callbacks that will later
be used in debugfs code and will reduce code duplication
among the different read{32,64}/write{32,64} callbacks of
every asic.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:01:16 +02:00
kernel test robot 116a28ac1f habanalabs: hl_ts_behavior can be static
drivers/misc/habanalabs/common/memory.c:2137:28: warning: symbol 'hl_ts_behavior' was not declared. Should it be static?

Fixes: 4d530e7d12 ("habanalabs: convert ts to use unified memory manager")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 21:00:35 +02:00
Ohad Sharabi d0b59cf68c habanalabs/gaudi: add debugfs to fetch internal sync status
When Gaudi device is secured the monitors data in the configuration
space is blocked from PCI access.
As we need to enable user to get sync-manager monitors registers when
debugging, this patch adds a debugfs that dumps the information to a
binary file (blob).
When a root user will trigger the dump, the driver will send request to
the f/w to fill a data structure containing dump of all monitors
registers.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:37 +02:00
Ohad Sharabi f5d85fe05a habanalabs: rephrase device out-of-memory message
The out of memory message is rephrased to more subtle expression as out
of memory may be caused by the user in case of, for example, greedy
allocation.

In addition the user is also being notified by an error code.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:36 +02:00
Dafna Hirschfeld c3712c1d7d habanalabs/gaudi: Use correct sram size macro for debugfs
We currently allow accessing the whole SRAM bar size with
the macro SRAM_BAR_SIZE, but the actual size of the sram
region is the macro SRAM_SIZE which is only a portion of
the whole bar size. So when accessing the sram through
debugfs, use the macro SRAM_SIZE for the sram size
which is the correct macro.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:36 +02:00
Ohad Sharabi acbabe63ef habanalabs: add MMU prefetch to ASIC-specific code
This is necessary pre-requisite for future ASIC support, where MMU
TLB prefetch is supported.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:36 +02:00
Yuri Nudelman 4d530e7d12 habanalabs: convert ts to use unified memory manager
With the introduction of the unified memory manager infrastructure, the
timestamp buffers can be converted to use it.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:36 +02:00
Yuri Nudelman babe8e7c04 habanalabs: unified memory manager infrastructure
This is a part of overall refactoring attempt to separate nic and the
core drivers.
Currently, there are 4 different flows, that contain very similar code.
These are the ts, nic, hwblocks and cb alloc/map flows. The similar
aspect of all these flows is that they all contain a central store, with
memory buffers inside, supporting the following set of operations:

- Allocate buffer and return handle
- Get buffer from the store with handle
- Put the buffer (last put releases the buffer)
- Map the buffer to the user

This patch contains a generic data structure used to implement the above
memory buffer store interface. Conversion of the existing code to use
the new data structure will follow.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:36 +02:00
Ofir Bitton b75cce27d0 habanalabs: save f/w preboot major version
We need this property for doing backward compatibility hacks against
the f/w.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:35 +02:00
Jakob Koschel 9138c24244 habanalabs: replace usage of found with dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.

To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].

This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:35 +02:00
Tomer Tayar 687c6b535e habanalabs: modify dma_mask to be ASIC specific property
The required DMA mask is no longer based on input from the F/W, but it
is fixed per ASIC according to its address space.
As such, the per-ASIC function to get this value can be replaced with a
property variable.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:35 +02:00
Ofir Bitton c41cb902b2 habanalabs: parse full firmware versions
When parsing firmware versions strings, driver should not
assume a specific length and parse up to the maximum supported
version length.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:35 +02:00
Tomer Tayar 9d92689ca2 habanalabs/gaudi: avoid resetting max power in hard reset
The default max power is deduced from the card type value in the CPU-CP
info. This value is then set in the max power variable of the device
structure.
Getting the CPU-CP info is done as part of the late init phase
which is called also during reset. This means that a max power value
which is modified via sysfs will be reset during hard reset back to the
default value.
As the max power is updated in any case during device init in
hl_sysfs_init(), this setting in late init can be removed, and the
overriding during reset is thus avoided.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:35 +02:00
Ofir Bitton b19768d81a habanalabs/gaudi: increase submission resources
In order to allow user to have larger amount of submissions, we
increase the DMA and NIC queue depth to 4K.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:34 +02:00
Ofir Bitton fdec56c1a4 habanalabs: expose compute ctx status through info ioctl
In order for the user to know if he can try and open device, we
expose the compute ctx state. The user can now know if the context
is used by another process or whether the device is still ongoing
through cleanup or reset and will be available soon.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:34 +02:00
Ofir Bitton 4c3b9f6e3b habanalabs: add new return code to device fd open
In order to be more informative during device open, we are adding a
new return code -EAGAIN that indicates device is still going through
resource reclaiming and hence it cannot be used yet.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:34 +02:00
Ohad Sharabi 050a6f349a habanalabs: add user API to get valid DRAM page sizes
Future devices will support multiple device memory page sizes.
In addition, an API for the user was added for it to be able to control
the device memory allocation page size.

This patch is a complementary patch to inform the user of the available
page size supported by the device.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:34 +02:00
Ohad Sharabi 06926dbed2 habanalabs: convert all MMU masks/shifts to arrays
There is no need to hold each MMU mask/shift as a denoted structure
member (e.g. hop0_mask).

Instead converting it to array will result in smaller and more readable
code.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:34 +02:00
Ohad Sharabi 2f8f0de878 habanalabs: change mmu_get_real_page_size to be ASIC-specific
This patch breaks the cumbersome implementation of "get real page size"
along with it's multiple inner conditions and implement each case
(according to the real complexity) inside an ASIC function.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:33 +02:00
Ohad Sharabi 1359fcbe0f habanalabs: add DRAM default page size to HW info
When using the device memory allocation API the user ought to know what
is the default allocation page size.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:33 +02:00
Ohad Sharabi 378b02dc01 habanalabs: set non-0 value in dram default page size
Looking forward we will need to report to the user what is the default
page size used.

This will be done more conveniently by explicitly updating the property
rather than to rely on a "0 meaning default" value.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22 20:57:33 +02:00
Guenter Roeck 94865e2dcb habanalabs: Fix test build failures
allmodconfig builds on 32-bit architectures fail with the following error.

drivers/misc/habanalabs/common/memory.c: In function 'alloc_device_memory':
drivers/misc/habanalabs/common/memory.c:153:49: error:
	cast from pointer to integer of different size

Fix the typecast. While at it, drop other unnecessary typecasts associated
with the same commit.

Fixes: e8458e20e0 ("habanalabs: make sure device mem alloc is page aligned")
Cc: Ohad Sharabi <osharabi@habana.ai>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20220404134859.3278599-1-linux@roeck-us.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-04 17:03:04 +02:00
Linus Torvalds 02e2af20f4 Char/Misc and other driver updates for 5.18-rc1
Here is the big set of char/misc and other small driver subsystem
 updates for 5.18-rc1.
 
 Included in here are merges from driver subsystems which contain:
 	- iio driver updates and new drivers
 	- fsi driver updates
 	- fpga driver updates
 	- habanalabs driver updates and support for new hardware
 	- soundwire driver updates and new drivers
 	- phy driver updates and new drivers
 	- coresight driver updates
 	- icc driver updates
 
 Individual changes include:
 	- mei driver updates
 	- interconnect driver updates
 	- new PECI driver subsystem added
 	- vmci driver updates
 	- lots of tiny misc/char driver updates
 
 There will be two merge conflicts with your tree, one in MAINTAINERS
 which is obvious to fix up, and one in drivers/phy/freescale/Kconfig
 which also should be easy to resolve.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG3fQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykNEgCfaRG8CRxewDXOO4+GSeA3NGK+AIoAnR89donC
 R4bgCjfg8BWIBcVVXg3/
 =WWXC
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc and other driver updates from Greg KH:
 "Here is the big set of char/misc and other small driver subsystem
  updates for 5.18-rc1.

  Included in here are merges from driver subsystems which contain:

   - iio driver updates and new drivers

   - fsi driver updates

   - fpga driver updates

   - habanalabs driver updates and support for new hardware

   - soundwire driver updates and new drivers

   - phy driver updates and new drivers

   - coresight driver updates

   - icc driver updates

  Individual changes include:

   - mei driver updates

   - interconnect driver updates

   - new PECI driver subsystem added

   - vmci driver updates

   - lots of tiny misc/char driver updates

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (556 commits)
  firmware: google: Properly state IOMEM dependency
  kgdbts: fix return value of __setup handler
  firmware: sysfb: fix platform-device leak in error path
  firmware: stratix10-svc: add missing callback parameter on RSU
  arm64: dts: qcom: add non-secure domain property to fastrpc nodes
  misc: fastrpc: Add dma handle implementation
  misc: fastrpc: Add fdlist implementation
  misc: fastrpc: Add helper function to get list and page
  misc: fastrpc: Add support to secure memory map
  dt-bindings: misc: add fastrpc domain vmid property
  misc: fastrpc: check before loading process to the DSP
  misc: fastrpc: add secure domain support
  dt-bindings: misc: add property to support non-secure DSP
  misc: fastrpc: Add support to get DSP capabilities
  misc: fastrpc: add support for FASTRPC_IOCTL_MEM_MAP/UNMAP
  misc: fastrpc: separate fastrpc device from channel context
  dt-bindings: nvmem: brcm,nvram: add basic NVMEM cells
  dt-bindings: nvmem: make "reg" property optional
  nvmem: brcm_nvram: parse NVRAM content into NVMEM cells
  nvmem: dt-bindings: Fix the error of dt-bindings check
  ...
2022-03-28 12:27:35 -07:00
Ofir Bitton 655221c567 habanalabs: remove deprecated firmware states
During driver and F/W handshake, driver waits for F/W to reach
certain states in order to progress with the boot flow.
Some of the states were deprecated a long time ago and were never
present on official firmwares. Therefore, let's remove them from
the handshake process.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:34:50 +02:00
Tomer Tayar b0106bc6fe habanalabs: add an option to delay a device reset
Several H/W events can be sent adjacently, even due to a single error.
If a hard-reset is triggered as part of handling one of these events,
the following events won't be handled.
The debug info from these missed events is important, sometimes even
more important than the one that was handled.

To allow handling these close events, add an option to delay a device
reset and use it when resetting due to H/W events.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:06 +02:00
Jiasheng Jiang 9c27896ac1 habanalabs: Add check for pci_enable_device
As the potential failure of the pci_enable_device(),
it should be better to check the return value and return
error if fails.

Fixes: 70b2f993ea ("habanalabs: create common folder")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:06 +02:00
farah kassabri a78b07dcae habanalabs: Fix reset upon device release bug
In case user application was interrupted while some cs still in-flight
or in the middle of completion handling in driver, the
last refcount of the kernel private data for the user process
will not be put in the fd close flow, but in the cs completion
workqueue context.

This means that the device reset-upon-device-release will be called
from that context. During the reset flow, the driver flushes all the cs
workqueue to ensure that any scheduled work has run to completion,
and since we are running from the completion context we will
have deadlock.

Therefore, we need to skip flushing the workqueue in those cases.
It is safe to do it because the user won't be able to release the device
unless the workqueues are already empty.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:06 +02:00
Ohad Sharabi e8458e20e0 habanalabs: make sure device mem alloc is page aligned
Working with MMU that supports multiple page sizes requires that mapping
of a page of a certain size will be aligned to the same size (e.g. the
physical address of 32MB page shall be aligned to 32MB).

To achieve this the gen_poll allocation is now using the "align" variant
to comply with the alignment requirements.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Oded Gabbay 100fcf1e11 habanalabs/gaudi: add missing handling of NIC related events
There are a few events that can arrive from the f/w and without proper
handling can cause errors to appear in the kernel log without reason.

Add the relevant handling that was missing.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Oded Gabbay 26ef1c000b habanalabs/gaudi: handle axi errors from NIC engines
Various AXI errors can occur in the NIC engines and are reported to
the driver by the f/w. Add code to print the errors and ack them to
the f/w.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Ohad Sharabi f23f280277 habanalabs: allow user to set allocation page size
In future ASICs the MMU will be able to work with multiple page sizes,
thus a new flag is added to allow the user to set the requested page
size.

This flag is added since the whole DRAM is allocated for the user and
the user also should be familiar with the memory usage use case.

As such, the user may choose to "over allocate" memory in favor of
performance (for instance- large page allocations covers more memory
in less TLB entries).

For example: say available page sizes are of 1MB and 32MB. If user
wants to allocate 40MB the user can either set page size to 1MB and
allocate the exact amount of memory (but will result in 40 TLB entries)
or the user can use 32MB pages, "waste" 8MB of physical memory but
occupy only 2 TLB entries.

Note that this feature will be available only to ASIC that supports
multiple DRAM page sizes.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Tomer Tayar 59456f4c22 habanalabs: avoid using an uninitialized variable
Fix the following compilation warning in
hl_cb_ioctl() @ command_buffer.c:
warning: ‘device_va’ may be used uninitialized in this function

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Tomer Tayar 2908826d09 habanalabs: set max power on device init per ASIC
For current devices there is a need to send the max power value to F/W
during device init, for example because there might be several card
types.
In future devices, this info will be programmed in the device's EEPROM
and will be read by F/W, and hence the driver should not send it.

Modify the sending of the relevant message to be done only for ASIC
types that need it.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Tomer Tayar 35629bc171 habanalabs: use proper max_power variable for device utilization
The max_power variable which is used for calculating the device
utilization is the ASIC specific property which is set during init.
However, the max value can be modified via sysfs, and thus the updated
value in the device structure should be used instead.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Tomer Tayar d01e6cc97b habanalabs: enable stop-on-error debugfs setting per ASIC
On Goya and Gaudi, the stop-on-error configuration can be set via
debugfs. However, in future devices, this configuration will always be
enabled.
Modify the debugfs node to be allowed only for ASICs that support this
dynamic configuration.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:05 +02:00
Oded Gabbay 4a0b01fa63 habanalabs: change function to static
handle_registration_node() is called directly from the irq handler
in irq.c, so it can be static.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay 9e70ac1aa7 habanalabs: add missing include of vmalloc.h
Use of vfree(), vmalloc_user(), vmalloc() and remap_vmalloc_range()
requires this include in some architectures.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay 57b6f02fff habanalabs: fix use-after-free bug
When the code iterates over the free list of physical pages nodes, it
deletes the physical page node which is used as the iterator.

Therefore, we need to use the safe version of the iteration to prevent
use-after-free.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay 2a835946ee habanalabs: rephrase error messages in PCI initialization
The iATU is an internal h/w machine inside Habana's PCI controller.
Mentioning it by name doesn't say anything to the user. It is better
to say the PCI controller initialization was not done successfully.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay 960be39db6 habanalabs: fix spelling mistake
The name of the property is hints_range_reservation

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
farah kassabri 9158bf69e7 habanalabs: Timestamps buffers registration
Timestamp registration API allows the user to register
a timestamp record event which will make the driver set
timestamp when CQ counter reaches the target value
and write it to a specific location specified
by the user.
This is a non blocking API, unlike the wait_for_interrupt
which is a blocking one.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Dani Liberman b32cd10480 habanalabs: fix race when waiting on encaps signal
Scenario:
1. CS which is part of encaps signal has been completed and now
executing kref_put to its encaps signal handle. The refcount of the
handle decremented to 0, and called the encaps signal handle
release function - hl_encaps_handle_do_release.

2. At this point the user starts waiting on the signal, and finds the
encaps signal handle in the handlers list and increment the habdle
refcount to 1.

3. Immediately after, hl_encaps_handle_do_release removed the handle
from the list and free its memory.

4. Wait function using the handle although it has been freed.

This scenario caused the slab area which was previously allocated
for the handle to be poison overwritten which triggered kernel bug
the next time the OS needed to allocate this slab.

Fixed by getting the refcount of the handle only in case it is not
zero.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Dan Carpenter a8076c47f6 habanalabs: silence an uninitialized variable warning
Smatch warns that:

    drivers/misc/habanalabs/common/command_buffer.c:471 hl_cb_ioctl()
    error: uninitialized symbol 'device_va'.

Which is true, but harmless.  Anyway, it's easy to silence this by
adding a error check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Oded Gabbay d2cfd6897c habanalabs: remove duplicate print
We print detailed messages inside the internal ioctl functions. No need
to print a generic message at the end, it doesn't add any information.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:04 +02:00
Tomer Tayar 930feb41ef habanalabs: prevent false heartbeat failure during soft-reset
The heartbeat thread is active during soft-reset, and it tries to send
messages to CPU-CP core.
Within the soft-reset, in the time window in which the device is marked
as disabled, any CPU-CP command is "silently" skipped and a success
value it returned.
However, in addition to the return value, the heartbeat function also
checks the F/W result, but because no command is sent in this time
window, the result variable won't hold the expected value and we will
have a false heartbeat failure.

To avoid it, modify the "silent" skip to be done only in hard-reset.
The CPU-CP should be able to handle messages during soft-reset.

In addition to the heartbeat problem, this should also solve other
issues in other flows that send messages during soft-reset and use the
F/W result as it w/o being aware to the reset.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay 7a78d4d481 habanalabs: fix race between wait and irq
There is a race in the user interrupts code, where between checking
the target value and adding the new pend to the list, there is a chance
the interrupt happened.

In that case, no one will complete the node, and we will get a timeout
on it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay 54faa5607b habanalabs: fix user interrupt wait when timeout is 0
When timeout is 0, we need to return the busy status in case the
target value wasn't reached upon entry to the ioctl.

Also return the correct timestamp.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay 9a79e3e4a3 habanalabs: reject host map with mmu disabled
This is not something we can do a workaround. It is clearly an error
and we should notify the user that it is an error.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay aa3766def7 habanalabs: expose number of user interrupts
Currently we only expose to the user the ID of the first available
user interrupt. To make user interrupts allocation truly dynamic, we
need to also expose the number of user interrupts.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Oded Gabbay 008255ec3d habanalabs: update to latest f/w specs
Copy the latest versions of the f/w specs files.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Tomer Tayar 4ae9548de7 habanalabs: add missing error check in sysfs max_power_show
Add a missing error check in the sysfs show function for max_power.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Dani Liberman 15f8eb1905 habanalabs: fix soft reset flow in case of failure
In case of soft reset failure, hard reset should be initiated, but
reset flags were not set to enable it, which caused another soft reset
followed by another failure.
Updated reset flags to enable hard reset flow in case of soft reset
failure.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Tomer Tayar aa3e1f12a2 habanalabs: add missing error check in sysfs clk_freq_mhz_show
Add a missing error check in the sysfs show functions for
clk_max_freq_mhz and clk_cur_freq_mhz_show.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:03 +02:00
Tomer Tayar ca4c8e4e7b habanalabs: avoid copying pll data if pll_info_get fails
If reading PLL info from F/W fails, the PLL info is not set in the
"result" variable, and hence shouldn't be copied to the caller's array.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay 7169f0dfec habanalabs: don't free phys_pg_pack inside lock
Freeing phys_pg_pack includes calling to scrubbing functions of the
device's memory, taking locks and possibly even calling reset.

This is not something that should be done while holding a device-wide
spinlock.

Therefore, save the relevant objects on a local linked-list and after
releasing the spinlock, traverse that list and free the phys_pg_pack
objects.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Ohad Sharabi 1dc6cc4b38 habanalabs: duplicate HOP table props to MMU props
In order to support several device MMU blocks with different
architectures (e.g. different HOP table size) we need to move to
per-MMU properties rather than keeping those properties as ASIC
properties.

Refactoring the code to use "per-MMU proprties" is a major effort.

To start making the transition towards this goal but still support
taking the properties from ASIC properties (for code that currently
uses them) this patch copies some of the properties to the "per-MMU"
properties and later, when implementing the per-MMU properties, we
would be able to delete the MMU props from the ASIC props.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay e24a62cb68 habanalabs: there is no kernel TDR in future ASICs
In future ASICs, there is no kernel TDR for new workloads that are
submitted directly from user-space to the device.

Therefore, the driver can NEVER know that a workload has timed-out.

So, when the user asks us to wait for interrupt on the workload's
completion, and the wait has timed-out, it doesn't mean the workload
has timed-out. It only means the wait has timed-out, which is NOT an
error from driver's perspective.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Rajaravi Krishna Katta 4c01e524b2 habanalabs: sysfs support for fw os version
Adds new sysfs entry to display firmware os version
/sys/class/habanalabs/hl<n>/fw_os_ver

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay 6ba2c0ce26 habanalabs: use common wrapper for MMU cache invalidation
We have a common function that wraps the call to the MMU cache
invalidation function, which is ASIC-specific. The wrapper checks
the return value and prints error if necessary. For consistency, try
to use the wrapper when possible.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay 2491533808 habanalabs: remove power9 workaround for dma support
We don't need this workaround anymore.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay b62ff1a412 habanalabs: add vrm version to sysfs
infineon version is only applicable to GOYA and GAUDI. For later
ASICs, we display the Voltage Regulator Monitor f/w version.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay be028a3648 habanalabs: rename dev_attr_grp to dev_clk_attr_grp
In this attribute group we are only adding clocks. This is in
preparation for adding a device specific attribute group which is
not related to clocks.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay 7ae439a061 habanalabs: remove asic callback set_pll_profile()
Setting PLL profile is the same for all ASICs, except for GOYA.
However, because this function is never called from common code, there
is no need to have an asic-specific callback function.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:02 +02:00
Oded Gabbay 092a31c526 habanalabs: move more f/w functions to firmware_if.c
For better maintainability, try to concentrate all the common functions
that communicate with the f/w in firmware_if.c

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Oded Gabbay 8d96430784 habanalabs: remove hwmgr.c
The two remaining functions in this file belong to firmware_if.c,
as they communicate with the firmware.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Oded Gabbay 9e2884ce98 habanalabs: get clk is common function
Retrieving the clock from the f/w is done exactly the same in ALL our
ASICs. Therefore, no real justification for doing it as an
ASIC-specific function.

The only thing is we need to check if we are running on simulator,
which doesn't require ASIC-specific callback.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Oded Gabbay bfbe9cbedd habanalabs: sysfs functions should be in sysfs.c
Move common sysfs store/show functions to sysfs.c file for
consistency.

This is part of a patch-set to remove hwmgr.c

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Ohad Sharabi 2bf338f2ac habanalabs: make some MMU functions common
Some MMU functions can be used by different versions of our MMUs, so
move them to be common.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Oded Gabbay d280d5954e habanalabs: remove ASIC functions of clock gating
Now that clock gating is permanently disabled in GAUDI, no need for
the ASIC functions of setting and disabling clock gating, as this
was a unique scenario in GAUDI.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Oded Gabbay 4edb4ffe39 habanalabs/gaudi: disable CGM permanently
Due to the need of SynapseAI to configure all TPC engines from a single
QMAN, the driver must disable CGM and never allow the user to enable
it. Otherwise, the configuration of the TPC engines will fail.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Ohad Sharabi eb85eec858 habanalabs: fix possible memory leak in MMU DR fini
This patch fixes what seems to be copy paste error.

We will have a memory leak if the host-resident shadow is NULL (which
will likely happen as the DR and HR are not dependent).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Tomer Tayar aff5d9d378 habanalabs: check the return value of hl_cs_poll_fences()
As part of handling of the multi-CS wait ioctl, hl_cs_poll_fences() is
called in a "while (true)" loop. This function can fail, but the
checking of its return value was missed.
Add this check and exit the loop in case of a failure.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-02-28 14:22:01 +02:00
Gustavo A. R. Silva 5224f79096 treewide: Replace zero-length arrays with flexible-array members
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

This code was transformed with the help of Coccinelle:
(next-20220214$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch)

@@
identifier S, member, array;
type T1, T2;
@@

struct S {
  ...
  T1 member;
  T2 array[
- 0
  ];
};

UAPI and wireless changes were intentionally excluded from this patch
and will be sent out separately.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2022-02-17 07:00:39 -06:00
Ofir Bitton ce80098db2 habanalabs: support hard-reset scheduling during soft-reset
As hard-reset can be requested during soft-reset, driver must allow
it or else critical events received during soft-reset will be
ignored.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:42:31 +02:00
Ofir Bitton 42eb2872e0 habanalabs: add a lock to protect multiple reset variables
Atomic operations during reset are replaced by a spinlock in order
to have the ability to protect more than a single variable.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:42:11 +02:00
Ofir Bitton eb13529191 habanalabs: refactor reset information variables
Unify variables related to device reset, which will help us to
add some new reset functionality in future patches.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:41:28 +02:00
Ohad Sharabi 60bf3bfb5a habanalabs: handle skip multi-CS if handling not done
This patch fixes issue in which we have timeout for multi-CS although
the CS in the list actually completed.

Example scenario (the two threads marked as WAIT for the thread that
handles the wait_for_multi_cs and CMPL as the thread that signal
completion for both CS and multi-CS):
1. Submit CS with sequence X
2. [WAIT]: call wait_for_multi_cs with single CS X
3. [CMPL]: CS X do invoke complete_all for both CS and multi-CS
           (multi_cs_completion_done still false)
4. [WAIT]: enter poll_fences, reinit the completion and find the CS
           as completed when asking on the fence but multi_cs_done is
	   still false it returns that no CS actually completed
5. [CMPL]: set multi_cs_handling_done as true
6. [WAIT]: wait for completion but no CS to awake the wait context
           and hence wait till timeout

Solution: if CS detected as completed in poll_fences but multi_cs_done
          is still false invoke complete_all to the multi-CS completion
	  and so it will not go to sleep in wait_for_completion but
	  rather will have a "second chance" to wait for
	  multi_cs_completion_done.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:40:25 +02:00
Tomer Tayar f297a0e9fe habanalabs: add CPU-CP packet for engine core ASID cfg
In some cases the driver cannot configure ASID of some engines due to
the security level of the relevant registers.
For this a new CPU-CP packet is introduced, which will allow the driver
to ask the F/W to do this configuration instead.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 14:39:53 +02:00
Oded Gabbay 519f4ed0a0 habanalabs: replace some -ENOTTY with -EINVAL
-ENOTTY is returned in case of error in the ioctl arguments themselves,
such as function that doesn't exists.

In all other cases, where the error is in the arguments of the custom
data structures that we define that are passed in the various ioctls,
we need to return -EINVAL.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:10 +02:00
Ofir Bitton 0a63ac769b habanalabs: fix comments according to kernel-doc
Fix missing fields, descriptions not according to kernel-doc style.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:10 +02:00
Ofir Bitton a7224c2116 habanalabs: fix endianness when reading cpld version
Current sysfs implementation does not take endianness into
consideration when dumping the cpld version.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
farah kassabri b9d31cada7 habanalabs: change wait_for_interrupt implementation
Currently the cq counters are allocated in userspace memory,
and mapped by the driver to the device address space.

A new requirement that is part of new future API related to this one,
requires that cq counters will be allocated in kernel memory.

We leverage the existing cb_create API with KERNEL_MAPPED flag set to
allocate this memory.

That way we gain two things:
1. The memory cannot be freed while in use since it's protected
by refcount in driver.

2. No need to wake up the user thread upon each interrupt from CQ,
because the kernel has direct access to the counter. Therefore,
it can make comparison with the target value in the interrupt
handler and wake up the user thread only if the counter reaches the
target value. This is instead of waking the thread up to copy counter
value from user then go sleep again if target value wasn't reached.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ohad Sharabi e2558f0f84 habanalabs: prevent wait if CS in multi-CS list completed
By the original design we assumed that if we "miss" multi CS completion
it is of no severe consequence as we'll just call wait_for_multi_cs
again.

Sequence of events for such scenario:
1. user submit CS with sequence N
2. user calls wait for multi-CS with only CS #N in the list
3. the multi CS call starts with poll of the CSs but find that none
   completed (while CS #N did not completed yet)
4. now, multi CS #N complete but multi CS CTX was not yet created for
   the above multi-CS. so, attempt to complete multi-CS fails (as no
   multi CS CTX exist)
5. wait_for_multi_cs call now does init_wait_multi_cs_completion (and
   for this create the multi-CS CTX)
6. wait_for_multi_cs wits on completion but will not get one as CS #N
   already completed

To fix the issue we initialize the multi-CS CTX prior polling the
fences.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ofir Bitton 86c00b2c36 habanalabs: modify cpu boot status error print
As BTL can be replaced by ROM we should modify relevant error print.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ohad Sharabi d636a932b3 habanalabs: clean MMU headers definitions
During the MMU development the MMU header files were left with unclean
definitions:

- MMU "version specific" definitions that were left in the mmu_general
  file
- unused definitions

This patch attempts, where possible, to keep definitions that can serve
multiple MMU versions (but that are not tightly bound with specific MMU
arch) in the mmu_general header file (e.g. different definitions for
number of HOPs).

Otherwise, move MMU version specific definitions (e.g. HOPs masks and
shifts) to the specific MMU version file.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ofir Bitton 9993f27de1 habanalabs: expose soft reset sysfs nodes for inference ASIC
As we allow soft-reset to be performed only on inference devices,
having the sysfs nodes may cause a confusion. Hence, we remove those
nodes on training ASICs.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ofir Bitton b5c92b8882 habanalabs: sysfs support for two infineon versions
Currently sysfs support dumping a single infineon version, in
future asics we will have two infineon versions.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Dani Liberman 707c125286 habanalabs: keep control device alive during hard reset
Need to allow user retrieve data during reset and afterwards without
the need to reopen the device.
Did it by seperating the user peocesses list into two lists:
1. fpriv_list which contains list of user processes that opened
   the device (currently only one).
2. fpriv_ctrl_list which contains list of user processes that opened
   the control device. This processes in this list shall not be
   killed during reset, only when the device is suddenly removed from
   PCI chain.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Oded Gabbay bb099a8051 habanalabs: fix hwmon handling for legacy f/w
In legacy f/w that use old hwmon.h file, the values of the hwmon
enums are different than the values that are in newer kernels (5.6
and above).

Therefore, to support working with those f/w, we need to do some
fixup before registering with the hwmon subsystem and also when
calling the functions that communicate with the f/w to retrieve
sensors information.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Ofir Bitton 9acdc21b0b habanalabs: add current PI value to cpu packets
In order to increase cpucp messaging reliability we will add
the current PI value to the descriptor sent to F/W.
F/W will wait for the PI value as an indication of a valid packet.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:09 +02:00
Oded Gabbay 7363805b8a habanalabs: remove in_debug check in device open
The driver supports only a single user anyway, so there is no point
in checking whether we are in_debug state when a user tries to open
the device, because if we are in_debug, it means a user is already
using the device.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Ofir Bitton 7c623ef732 habanalabs: return correct clock throttling period
Current clock throttling period returned from driver was wrong due
to wrong time comparison.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Ohad Sharabi b02220536c habanalabs: wait again for multi-CS if no CS completed
The original multi-CS design assumption that stream masters are used
exclusively (i.e. multi-CS with set of stream master QIDs will not get
completed by CS not from the multi-CS set) is inaccurate.

Thus multi-CS behavior is now modified not to treat such case as an
error.

Instead, if we have multi-CS completion but we detect that no CS from
the list is actually completed we will do another multi-CS wait (with
modified timeout).

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay 5b90e59d55 habanalabs: remove compute context pointer
It was an error to save the compute context's pointer in the device
structure, as it allowed its use without proper ref-cnt.

Change the variable to a flag that only indicates whether there is
an active compute context. Code that needs the pointer will now
be forced to use proper internal APIs to get the pointer.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay 4337b50b5f habanalabs: add helper to get compute context
There are multiple places where the code needs to get the context's
pointer and increment its ref cnt. This is the proper way instead
of using the compute context pointer in the device structure.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay 6798676f7e habanalabs: fix etr asid configuration
Pass the user's context pointer into the etr configuration function
to extract its ASID.

Using the compute_ctx pointer is an error as it is just an indication
of whether a user has opened the compute device.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay 357ff3dc9a habanalabs: save ctx inside encaps signal
Compute context pointer in hdev shouldn't be used for fetching the
context's pointer.

If an object needs the context's pointer, it should get it while
incrementing its kref, and when the object is released, put it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay a4dd2ecf36 habanalabs: remove redundant check on ctx_fini
The driver supports only a single context. Therefore, no need to check
if the user context that is closed is the compute context. The user
context, if exists, is always the compute context.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Oded Gabbay fee187fe46 habanalabs: free signal handle on failure
Fix a bug where in case of failure to allocate idr, the handle's
memory wasn't freed as part of the error handling code.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Tomer Tayar b166465452 habanalabs: add missing kernel-doc comments for hl_device fields
Add missing kernel-doc comments for the "last_error" and
"stream_master_qid_arr" fields of the "hl_device" structure".

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:08 +02:00
Tomer Tayar d214636be8 habanalabs: pass reset flags to reset thread
The reset flags used by the reset thread are currently a mix of
hard-coded values and a specific flag which is passed from the context
that initiates the reset.
To make it easier to pass more flags in future from this context to the
reset thread, modify it to pass all the original reset flags to the
thread.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Dani Liberman 2487f4a281 habanalabs: enable access to info ioctl during hard reset
Because info ioctl is used to retrieve data, some of its opcodes may be
used during hard reset.
Other ioctls should be blocked while device is not operational.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Dani Liberman 1880f7acd7 habanalabs: add SOB information to signal submission uAPI
For debug purpose, add SOB address and SOB initial counter value
before current submission to uAPI output.

Using SOB address and initial counter, user can calculate how much of
the submmision has been completed.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Ohad Sharabi 4fac990f60 habanalabs: skip read fw errors if dynamic descriptor invalid
Reporting FW errors involves reading of the error registers.

In case we have a corrupted FW descriptor we cannot do that since the
dynamic scratchpad is potentially corrupted as well and may cause kernel
crush when attempting access to a corrupted register offset.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Ofir Bitton 3416d4b59b habanalabs: handle events during soft-reset
Driver should handle events during soft-reset as F/W is not
going through reset and it keeps sending events towards host.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Ofir Bitton b13bef2041 habanalabs: change misleading IRQ warning during reset
Currently we dump the physical IRQ line index in host if an event
is received during reset. This ID is confusing as it means nothing
to the user.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Tomer Tayar 75a5c44d14 habanalabs: add power information type to POWER_GET packet
In new f/w versions, it is required to explicitly indicate the power
information type when querying the F/W for power info.
When getting the current power level it should be set to power_input.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Ofir Bitton 4119433445 habanalabs: add more info ioctls support during reset
Some info ioctls can be served even if the device is disabled or
in reset. Hence, we enable more info ioctls during reset, as these
ioctls do not require any H/W nor F/W communication.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Dani Liberman 3beaf903a3 habanalabs: fix race condition in multi CS completion
Race example scenario:
1. User have 2 threads that waits on multi CS:
   - thread_0 waits on QID 0 and uses multi CS context 0.
   - thread_1 waits on QID 1 and uses multi CS context 1.
2. thread_1 got completion and release multi CS context 1.
3. CS related to multi CS of thread_0 starts executing
   complete_multi_cs function, the first iteration of the loop
   completes the multi CS of thread_0, hence multi CS context 0
   is released.
4. thread_1 waits on QID 1 and uses multi CS context 0.
5. thread_0 waits on QID 0 and uses multi CS context 1.
6. The second iterattion of the loop (from step 3) starts, which
   means, start checking multi CS context 1:
   - multi CS contetxt is being used by thread_0 waiting on QID 0.
   - The fence of the CS (still CS from step 3) has QID map the same
     as the multi CS context 1.
   - multi CS context 1 (thread_0) gets completion on CS that triggered
     already thread_0 (with multi CS context 0) and is no longer
     being waited on.

Fixed by exiting the loop in complete_multi_cs after getting completion

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Ofir Bitton cad9eb4a8d habanalabs: move device boot warnings to the correct location
As device boot warnings clears the indication from the error mask,
they must be located together before the unknown error validation.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:07 +02:00
Oded Gabbay 9eade72e72 habanalabs/gaudi: return EPERM on non hard-reset
GAUDI supports only hard-reset. Therefore, this function should
return an error of operation not permitted.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Oded Gabbay 6c1bad35e6 habanalabs: rename late init after reset function
The ASIC-specific soft_reset_late_init() is now called after either
soft-reset or reset-upon-device-release. Therefore, it needs a more
appropriate name.

No need to split it to two functions, as an ASIC either supports
soft-reset or reset-upon-device-release.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Oded Gabbay 60e0431f41 habanalabs: fix soft reset accounting
Reset upon device release is not a soft-reset from user/system point
of view. As such, we shouldn't count that reset in the statistics we
gather and expose to the monitoring applications.

We also shouldn't print soft-reset when doing the reset upon device
release.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Rajaravi Krishna Katta d8eb50f31c habanalabs: Move frequency change thread to goya_late_init
Changing the frequency automatically is only done in Goya. In future
ASICs this is done inside the firmware. Therefore, move the common code
into the Goya specific files.

Main changes as part of the commit are:
    1. The thread for setting frequency is moved from device_late_init
       to goya_late_init
    2. hl_device_set_frequency is removed from hl_device_open as it is
       not relevant for other ASICs and for Goya it is taken care by
       the thread
    3. hl_device_set_frequency is renamed as goya_set_frequency

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Oded Gabbay ab440d3e39 habanalabs: abort reset on invalid request
Hard-reset is mutually exclusive with reset-on-device-release.
Therefore, if such a request arrives to the reset function, abort
the reset and return an error to the callee.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Ofir Bitton a1b838adb0 habanalabs: fix possible deadlock in cache invl failure
Currently there is a deadlock in driver in scenarios where MMU
cache invalidation fails. The issue is basically device reset
being performed without releasing the MMU mutex.
The solution is to skip device reset as it is not necessary.
In addition we introduce a slight code refactor that prints the
invalidation error from a single location.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Ohad Sharabi 6f61e47a68 habanalabs: skip PLL freq fetch
Getting the used PLL index with which to send the CPUPU packet relies on
the CPUCP info packet.

In case CPU queues are not enabled getting the PLL index will issue an
error and in some ASICs will also fail the driver load.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Oded Gabbay fe8d70873c habanalabs: prevent false heartbeat message
If a device reset has started, there is a chance that the heartbeat
function will fail because the device is disabled at the beginning
of the reset function.

In that case, we don't want the error message to appear in the log.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:06 +02:00
Dani Liberman 3e55b5dbf9 habanalabs: add support for fetching historic errors
A new uAPI is added for debug purposes of the user-space to retrieve
errors related data from previous session (before device reset was
performed).

Inforamtion is filled when a razwi or CS timeout happens and can
contain one of the following:

1. Retrieve timestamp of last time the device was opened and razwi or
   CS timeout happened.
2. Retrieve information about last CS timeout.
3. Retrieve information about last razwi error.

This information doesn't contain user data, so no danger of data
leakage between users.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Ofir Bitton e2637fdca7 habanalabs: handle device TPM boot error as warning
AS TPM error indication is not fatal, driver should dump a warning
and continue booting.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Ofir Bitton 3eb7754ff4 habanalabs: debugfs support for larger I2C transactions
I2C debugfs support is limited to 1 byte. We extend functionality
to more than 1 byte by using one of the pad fields as a length.
No backward compatibility issues as new F/W versions will treat 0
length as a 1 byte length transaction.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Oded Gabbay e617f5f4c1 habanalabs: make hdev creation code more readable
Divide the code into 3 different parts:
- Copy kernel parameters
- Setting device behaivor per asic
- Fixup of various device parameters according to the device behaivor.

In addition, remove non-relevant code for upstream (simulator support).

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
farah kassabri 49c052dad6 habanalabs: add new opcodes for INFO IOCTL
Add implementation for new opcodes in the INFO IOCTL:
1. Retrieve the replaced DRAM rows from f/w.
2. Retrieve the pending DRAM rows from f/w.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Bharat Jauhari d4194f2140 habanalabs: refactor wait-for-user-interrupt function
Refactor the wait-for-user-interrupt routine to make it more
generic for re-use for other user exposed h/w interfaces in future
ASICs.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
farah kassabri 792512459f habanalabs/gaudi: Fix collective wait bug
In Signaling-From-Graph case, the driver didn't set the hw_sob pointer
at the right place, which is needed for the cs completion
check prior to start sending all the master/slaves jobs to device.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Ofir Bitton 1679c7ee58 habanalabs: expand clock throttling information uAPI
In addition to the clock throttling reason, user should be able
to obtain also the start time and the duration of the throttling
event.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Dani Liberman 48f3116983 habanalabs: change wait for interrupt timeout to 64 bit
In order to increase maximum wait-for-interrupt timeout, change it
to 64 bit variable. This wait is used only by newer ASICs, so no
problem in changing this interface at this time.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Bharat Jauhari 234caa5273 habanalabs: rename reset flags
Rename reset flags for better readability as compared to
HL_RESET_CAUSE* enum shared with the f/w.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:05 +02:00
Rajaravi Krishna Katta e84e31a912 habanalabs: add dedicated message towards f/w to set power
CPUCP_PACKET_POWER_GET packet type was used for both
hl_get_power() and hl_set_power().

To align with other sensor functions hl_set_power()
should use CPUCP_PACKET_POWER_SET.

This packet will only be used with newer ASICs, so need to add
a compatibility flag to the asic properties to indicate whether to use
this packet or the GET packet.

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Bharat Jauhari 1388582264 habanalabs: handle abort scenario for user interrupt
In case of device reset, the driver does a force trigger on all waiting
users to release them from waiting. However, the driver does not handle
error scenario while waiting.

hl_interrupt_wait_ioctl() now exits the wait in case of an error with
abort status.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Ohad Sharabi 5edd95a4ab habanalabs: don't clear previous f/w indications
Once we read indication of whether f/w is doing the reset, we don't
want to clear it, until the next time we read this indication.

Otherwise, we might be in a state of wrong indication.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Ohad Sharabi f4e7906dbe habanalabs: use variable poll interval for fw loading
Using a variable poll interval for fw loading allows us to support
much slower environments (emulation) while changing only a single
line in the code, instead of choosing a different interval in each
function that polls.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Ohad Sharabi 8f82ff75df habanalabs: adding indication of boot fit loaded
Up until now the driver stored indication if Linux was loaded on the
device CPU. This was needed in order to coordinate some tasks that are
performed by the Linux.

In future ASICs, many of those tasks will be performed by the boot
fit, so now we need the same indication of boot fit load status.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Yuri Nudelman 6ccba9a3bc habanalabs: partly skip cache flush when in PMMU map flow
The PCI MMU cache is two layered. The upper layer, memcache, uses cache
lines, the bottom layer doesn't.

Hence, after PMMU map operation we have to invalidate memcache, to avoid
the situation where the new entry is already in the cache due to its
cache line being fully in the cache.

However, we do not have to invalidate the lower cache, and here we can
optimize, since cache invalidation is time consuming.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Yuri Nudelman 82e5169e8a habanalabs: add enum mmu_op_flags
The enum vm_type was abused, used once as a value (indication
memory type for map) and once as a flag (for cache invalidation).
This makes it hard to add new and still keep it meaningful, hence it
is better to split into one enum for values and one for flags.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Yuri Nudelman 89d6decdb7 habanalabs: make last_mask an MMU property
Currently LAST_MASK is a global, but really it is an MMU implementation
specific. We need this change for future ASICs.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Yuri Nudelman f06bad02b5 habanalabs: wrong VA size calculation
VA blocks are currently stored in an inconsistent way. Sometimes block
end is inclusive, sometimes exclusive. This leads to wrong size
calculations in certain cases, plus could lead to a segmentation fault
in case mapping process fails in the middle and we try to roll it back.
Need to make this consistent - start inclusive till end inclusive.

For example, the regions table may now look like this:
    0x0000 - 0x1fff : allocated
    0x2000 - 0x2fff : free
    0x3000 - 0x3fff : allocated

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:04 +02:00
Guy Zadicario 90d283b672 habanalabs/gaudi: fix debugfs dma channel selection
Do not use a dma channel for debugfs requested transfer if it's
QM is not idle.

Signed-off-by: Guy Zadicario <gzadicario@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:03 +02:00
Ohad Sharabi bfd5110682 habanalabs: revise and document use of boot status flags
The boot status flag "SRAM available" can be set by f/w Linux (in the
general case) or by f/w uboot (in some specific debug scenario) but
never by f/w preboot.

Hence, when polling the boot status flags in the preboot stage we do not
want to poll on "SRAM Avialable".

The special case in which uboot set this flag is when we are running
special debug scenario without Linux. In this case, at some point during
the boot, the uboot relocates its code to the DRAM and then set the
specified flag.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:03 +02:00
Yuri Nudelman ba3aca31f9 habanalabs: print va_range in vm node debugfs
VA range info could assist in debugging VA allocation bugs.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:03 +02:00
Oded Gabbay 4cd454a205 habanalabs/gaudi: recover from CPU WD event
There are rare cases where the device CPU's watchdog has expired and as
a result, the watchdog reset has happened and the CPU will now move to
running its preboot f/w.

When that happens, the driver will only know that a heartbeat failure
occurred. As a result, the driver will send a message to the CPU's main
f/w asking it to reset the device, but because the CPU is now running
preboot, it won't respond and the re-initialization process will later
fail when trying to load the f/w.

The solution is to send the request to the preboot as well, only if the
reset was caused because of HB failure.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:03 +02:00
Ohad Sharabi c9d1383c75 habanalabs: modify wait for boot fit in dynamic FW load
In the dynamic FW load protocol the boot status is updated to
"Ready to Boot" once uboot is active.

Polling on other boot status values is a residue of code duplication
from the static protocol and should be removed.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-12-26 08:59:03 +02:00
Greg Kroah-Hartman 16b0314aa7 dma-buf: move dma-buf symbols into the DMA_BUF module namespace
In order to better track where in the kernel the dma-buf code is used,
put the symbols in the namespace DMA_BUF and modify all users of the
symbols to properly import the namespace to not break the build at the
same time.

Now the output of modinfo shows the use of these symbols, making it
easier to watch for users over time:

$ modinfo drivers/misc/fastrpc.ko | grep import
import_ns:      DMA_BUF

Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: dri-devel@lists.freedesktop.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sumit Semwal <sumit.semwal@linaro.org>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20211010124628.17691-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-25 14:53:08 +02:00
Dani Liberman b2faac3887 habanalabs: refactor fence handling in hl_cs_poll_fences
To avoid checking if fence exists multipled times, changed fence
handling to depend only on the fence status field:

Busy, which means CS still did not completed :
	Add its QID so multi CS wait on its completion.
Finished, which means CS completed and fence exists:
	Raise its completion bit if it finished mcs handling and
	update if necessary the earliest timestamp.
Gone, which means CS already completed and fence deleted:
	Update multi CS data to ignore timestamp and raise its
	completion bit.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:48 +03:00
Omer Shpigelman fae132632c habanalabs: context cleanup cosmetics
No need to check the return value if the following action is the same for
both cases. In addition, now that hl_ctx_free() doesn't print if the
context is not released, its name can be misleading as the context might
stay alive after it is executed with no indication for that.
Hence we can discard it and simply put the refcount.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:48 +03:00
Yuri Nudelman d2f5684b8f habanalabs: simplify wait for interrupt with timestamp flow
Remove the flag that determines whether to take a timestamp once the
interrupt arrives.
Instead, always take the timestamp once per interrupt.
This is a must for the user-space to measure its graph operations
to evaluate the graph computation time.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Moti Haimovski 4a18dde5e4 habanalabs: initialize hpriv fields before adding new node
When adding a new node to the hpriv list, the driver should
initialize its fields before adding the new node.

Otherwise, there may be some small chance of another thread traversing
that list and accessing the new node's fields without them being
initialized.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Rajaravi Krishna Katta 024b7b1d6d habanalabs: Unify frequency set/get functionality
Make the frequency set/get functionality common to all ASICs.
This makes more code reusable when adding support for newer ASICs.

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Vegard Nossum f6fb34390c habanalabs: select CRC32
Fix the following build/link error by adding a dependency on the CRC32
routines:

  ld: drivers/misc/habanalabs/common/firmware_if.o: in function `hl_fw_dynamic_request_descriptor':
  firmware_if.c:(.text.unlikely+0xc89): undefined reference to `crc32_le'

Fixes: 8a43c83fec ("habanalabs: load boot fit to device")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Tomer Tayar db1a8dd916 habanalabs: add support for dma-buf exporter
Implement the calls to the dma-buf kernel api to create a dma-buf
object backed by FD.

We block the option to mmap the DMA-BUF object because we don't support
DIRECT_IO and implicit P2P. We only implement support for explicit P2P
through importing the FD of the DMA-BUF.

In the export phase, we provide to the DMA-BUF object an array of pages
that represent the device's memory area. During the map callback,
we convert the array of pages into an SGT. We split/merge the pages
according to the dma max segment size of the importer.

To get the DMA address of the PCI bar, we use the dma_map_resources()
kernel API, because our device memory is not backed by page struct
and this API doesn't need page struct to map the physical address to
a DMA address.

We set the orig_nents member of the SGT to be 0, to indicate to other
drivers that we don't support CPU mappings.

Note that in Habanalabs's ASICs, the device memory is pinned and
immutable. Therefore, there is no need for dynamic mappings and pinning
callbacks.

Also note that in GAUDI we don't have an MMU towards the device memory
and the user works on physical addresses. Therefore, the user doesn't
pass through the kernel driver to allocate memory there. As a result,
only for GAUDI we receive from the user a device memory physical address
(instead of a handle) and a size.

We check the p2p distance using pci_p2pdma_distance_many() and refusing
to map dmabuf in case the distance doesn't allow p2p.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Dani Liberman 81f8582ec4 habanalabs: fix NULL pointer dereference
When polling fences for multi CS, it is possible that fence is
no longer exists (its corresponding CS completed and the fence was
deleted) but we still accessing its parameters, causing NULL pointer
dereference.

Fixed by checking if fence exits before accessing its parameters.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Dani Liberman ea6eb91c09 habanalabs: fix race condition in multi CS completion
Race condition occurs when CS fence completes and multi CS did not
completed yet, while waiting for multi CS ends and returns indication
to user that the CS completed. Next wait for multi CS may be triggered
by previous multi CS completion without any current CS completed,
causing an error.

Example scenario :
1. User do multi CS wait for CSs 1 and 2 on master QID 0

2. CS 1 and 2 reached the "cs release" code. The thread of CS 1
   completed both the CS and multi CS handling but the completion
   thread of CS 2 completed the CS but still did not executed
   complete_multi_cs (note that in CS completion the sequence is to
   first do complete all for the CS and then another complete all to
   signal the multi_cs)

3. User received indication that CS 1 and 2 completed (since we check
   the CS fence and both indicated as completed) and immediately waits
   on CS 3 and 4, also on master QID 0.

4. Completion thread of CS2 executed complete_multi_cs before
   completion of CS 3 and 4 and so will trigger the multi CS wait of
   CSs 3 and 4 as they wait on master QID 0.

This will trigger multi CS completion although none of its
current CS has been completed.

Fixed by adding multi CS complete handling indication for each CS.
CS will be marked to the user as completed only if its fence completed
and multi CS handling is done.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Oded Gabbay 1d16a46b1a habanalabs: use only u32
In the kernel it is common to use u32 and not uint32_t.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Oded Gabbay efc6b04b86 habanalabs: update firmware files
Update the firmware headers to the latest version

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Bharat Jauhari 10cab81d1c habanalabs: bypass reset for continuous h/w error event
There may be a situation where drivers receives continuous fatal H/W
error events from FW immediately post reset cycle.
This may be due to some fault on the silicon itself.
In such case its better to bypass reset cycle so we won't be stuck in
endless loop of resets.

This commit bypasses reset request in case driver received two back to
back FW fatal error before first occurrence of heartbeat event.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:47 +03:00
Yuri Nudelman f05d17b226 habanalabs: take timestamp on wait for interrupt
Taking an accurate timestamp in a close proximity of the interrupt is
required for user side statistics management.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Oded Gabbay c1904127ce habanalabs: prevent race between fd close/open
The driver allows only a single process to open a device's FD at any
single time. This is done by checking "hdev->compute_ctx" under mutex.

Therefore, to prevent a race between the moment a user closes it's FD
and when another user tries to open the device, we need to make sure
that clearing this variable is the very last thing that is done in the
code of the FD's release.

I'm moving the idle check before clearing this variable and the
"reset on device release". btw, if the reset happens it will prevent
any other user from opening the device until the reset is finished.

An important thing to note is that we need to remove the user process
that is closing the device from the process list BEFORE calling the
reset function. That is to prevent a case where the reset code will
try to kill that user process and it is unnecessary as the process
doesn't hold any device/driver resources anymore.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Oded Gabbay 1282dbbd29 habanalabs: refactor reset log message
Reset to the device is not necessarily due to an error, so print it
as info instead of error.

In addition, print the type of reset we are doing:
- reset of the entire device (aka hard reset)
- reset of the device after user have released it (less than hard reset)
- lighter reset of an inference device (aka soft reset)

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Oded Gabbay a00f1f571e habanalabs: define soft-reset as inference op
Soft-reset is the procedure where we reset only the compute/DMA engines
of the device, without requiring the current user-space process to
release the device.

This type of reset can happen if TDR event occurred (a workload got
stuck) or by a root request through sysfs.

This is only relevant for inference ASICs, as there is no real-world
use-case to do that in training, because training runs on multiple
devices.

In addition, we also do (in certain ASICs) a reset upon device release.
That reset uses the same code as the soft-reset.

Therefore, to better differentiate between the two resets, it is better
to rename the soft-reset support as "inference soft-reset", to make
the code more self-explanatory.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Yuri Nudelman dd08335fb9 habanalabs: fix debugfs device memory MMU VA translation
The translation in debugfs of device memory MMU virtual addresses was
wrong as it did not take into consideration the fact that the page
sizes there can be not power of 2.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Ofir Bitton d62b9a6976 habanalabs: add support for a long interrupt target value
In order to avoid user target value wraparound, we modify the
current interface so user will be able to wait for an 8-byte
target value rather than a 4-byte value.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Ofir Bitton 027d53b03c habanalabs: remove redundant cs validity checks
During TDR handling, we check multiple times if CS is valid.
No need to perform this check as CS must be valid at all time
during the TDR handling.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00
Rajaravi Krishna Katta 2b28485d0a habanalabs: enable power info via HWMON framework
Add support to retrieve following power info via HWMON:
- instantaneous power value
- highest value since last reset
- reset the highest place holder

Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-10-18 12:05:46 +03:00