Commit Graph

119 Commits

Author SHA1 Message Date
Omer Shpigelman 3e08f157c2 habanalabs/gaudi: use direct MSI in single mode
Due to FLR scenario when running inside a VM, we must not use indirect
MSI because it might cause some issues on VM destroy.
In a VM we use single MSI mode in contrary to multi MSI mode which is
used in bare-metal.
Hence direct MSI should be used in single MSI mode only.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14 15:00:03 +03:00
Oded Gabbay c2aa713618 habanalabs: update to latest firmware headers
Add several new packets between driver and firmware.
Add matching compatibility bits for backward compatibility.
Add support for 4K event types.
Add information about pcie errors.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01 18:38:24 +03:00
Oded Gabbay 5dc9ffaff1 habanalabs: expose server type in INFO IOCTL
Add the server type property to the hl_info_hw_ip_info structure
that is exposed to the user via the INFO IOCTL.

This is needed by the userspace s/w stack to know the connections map
of the internal links that connect the ASIC among themselves inside the
server.

The F/W will tell us, as part of the NIC information, the server type
that the GAUDI is located in.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01 18:38:24 +03:00
Oded Gabbay 2a2c4b7403 habanalabs: update firmware header to latest version
Add two new fields regarding interrupts communication between driver
and f/w.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29 09:47:47 +03:00
Yuri Nudelman 77977ac875 habanalabs/gaudi: implement state dump
At the first stage, only gaudi core dump shall be implemented, not
including the status registers.

Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29 09:47:46 +03:00
Ofir Bitton c67b0579b8 habanalabs: update firmware header files
Update recent changes made in firmware header files, which contain
a minor COMMS protocol change and new error status definitions.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29 09:47:44 +03:00
Ofir Bitton 6c31f494d8 habanalabs/gaudi: add support for NIC DERR
We add support for NIC DERR ECC error events, in case this error
is received a device reset will be performed.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-21 10:21:28 +03:00
Ofir Bitton 7d5ba005cf habanalabs/gaudi: correct driver events numbering
Currently driver sends fc interrupt id to FW instead of using
cpu interrupt id. We intend to fix that and keep backward
compatibility by using the same interrupt values.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:42 +03:00
Ohad Sharabi e1222c2794 habanalabs: report EQ fault during heartbeat
In case we have EQ fault we would like to know about it.
For this, a status bitmask was added in which EQ_FAULT bit is
set by FW in case of EQ fault.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:41 +03:00
Ofir Bitton 254fac6d1a habanalabs/gaudi: add FW alive event support
In order for driver to be aware of process or thread crashes inside
GAUDI's CPU, we introduce a new event which contains all relevant
information. Upon event reception, driver will dump information and
will reset the device.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:41 +03:00
Oded Gabbay 5a967fb3a7 habanalabs/gaudi: update to latest f/w specs
Update the firmware interface files to their latest version.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:41 +03:00
Ofir Bitton 5bc691d849 habanalabs/gaudi: split host irq interfaces towards FW
Current implementation uses a single interrupt interface towards
FW, this interface is causing races between interrupt types.
We split this interface to interface per interrupt type.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:41 +03:00
Tomer Tayar ae151bcfab habanalabs/gaudi: add ARB to QM stop on error masks
Update the QM stop on error masks to also stop on ARB errors.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:40 +03:00
Oded Gabbay 1242e9f0f4 habanalabs: check running index in eqe control
To harden the event queue mechanism, we add a running index to the
control header of the entry.

The firmware writes the index in each entry and the driver verifies
that the index of the current entry is larger by 1 of the index of
the previous entry.

In case it isn't, the driver will treat the entry as if it wasn't
valid (it won't process it but won't skip it).

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:40 +03:00
Koby Elbaz e591a49cb5 habanalabs/gaudi: read GIC sts after FW is loaded
Reading of GIC privileged status will be done after F/W is loaded,
because privileged GIC capability is only available with the correct
ARMCP version, and after it's loaded.
Such versions necessarily support COMMS, so GIC alternatives (SP regs)
will be read directly from dynamic regs.

As well, initiation of DMA QMANs will occur after F/W is loaded
since it depends on GIC configuration.

In case F/W isn't loaded there's no problem since either way
there won't be any GIC IRQ handling.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:40 +03:00
Koby Elbaz 3e0ca9fab1 habanalabs/gaudi: send hard reset cause to preboot
LKD should provide hard reset cause to preboot prior to
loading any FW components (in case needed).
Current implementation is based on the new FW 'COMMS' protocol
In cased 'COMMS' is disabled - reset cause won't be sent.
Currently, only 2 reset causes are shared: HEARTBEAT & TDR.

Sending the reset cause will provide the missing watchdog
info that the firmware needs to provide to the BMC.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:40 +03:00
Koby Elbaz 8121736bbf habanalabs/gaudi: use scratchpad regs instead of GIC controller
Due to new security restrictions, GIC controller can no
longer be accessed from user/kernel.
To monitor that, a new status bit will be read from preboot
caps, indicating whether direct access to GIC is blocked.

In case it is blocked, driver will use scratchpad registers
instead of using GIC interface on two main scenarios:
The first of which LKD triggers interrupts to F/W through GIC,
and the second of when LKD configures all engines/QMANs
to write to GIC when they want to report an error.

From F/W perspective, it will poll on all SPs, and once IRQ
number is retrieved, SP register is cleared, and it will perform the
write to the GIC to trigger the IRQ handler.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:39 +03:00
Oded Gabbay 90bd4798a8 habanalabs: update to latest f/w headers
Update the common and GAUDI firmware header files to the latest version.

The latest version use the correct endianness types so this commit also
contains minor changes to the code to use the correct conversions when
reading/writing to the firmware structures.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:39 +03:00
Oded Gabbay 3b39840083 habanalabs: update firmware files to latest
Update the firmware files to the latest from the firmware team.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-06-18 15:23:38 +03:00
Ohad Sharabi 669b018835 habanalabs: update to latest F/W communication header
update files to latest version from F/W team.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:25 +03:00
Ohad Sharabi e9c2003be4 habanalabs: send dynamic msi-x indexes to f/w
In order to minimize hard coded values between F/W and the driver, we
send msi-x indexes dynamically to the F/W.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:24 +03:00
Koby Elbaz 7d21114b03 habanalabs: support DEVICE_UNUSABLE error indication from FW
In case of multiple ECC errors, FW will set the DEVICE_UNUSABLE bit.
On boot-up, the driver will therefore fail inserting the device.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:24 +03:00
Ohad Sharabi e8f9392a5c habanalabs: support legacy and new pll indexes
In order to use minimum of hard coded values common to LKD and F/W
a dynamic method to work with PLLs is introduced in this patch.
Formerly asic specific PLL numbering is now common for all asics.
To be backward compatible a bit in dev status is defined, if the bit is
not set LKD will keep working with old PLL numbering.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:24 +03:00
Ofir Bitton d661d79930 habanalabs/gaudi: Update async events header
Update with latest version from the Firmware team.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:24 +03:00
Ofir Bitton 2ea09537ad habanalabs/gaudi: reset device upon BMC request
In case the BMC of the devices' box wants to initiate a reset of
a specific device, it must go through driver.
Once driver will receive the request it will initiate a hard reset
flow.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:23 +03:00
Ohad Sharabi 99cb017e72 habanalabs: update hl_boot_if.h
Update to the latest version of the file as supplied by the F/W.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:23 +03:00
Ofir Bitton f209e5ad18 habanalabs/gaudi: update extended async event header
Update to the latest definition of the firmware

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:23 +03:00
Sagiv Ozeri 586f2caf0e habanalabs: return current power via INFO IOCTL
Add driver implementation for reading the current power from the device
CPU F/W.

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:23 +03:00
Ohad Sharabi 5d6a198f9d habanalabs: reset device in case of sync error
As the F/wW is the first to detect out of sync event, a new event is
added to notify the driver on such event. In which case the driver
performs hard reset.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:22 +03:00
farah kassabri b6821b3c65 habanalabs: set max asid to 2
currently we support only 2 asids in all asics.
asid 0 for driver, and asic 1 for user.
no need to setup 1024 asids configurations at init phase.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-04-09 14:09:22 +03:00
Ofir Bitton 5dbd7b4de6 habanalabs: improve communication protocol with cpucp
Current messaging communictaion protocol with cpucp can get out
of sync due to coherency issues. In order to improve the protocol
reliability, we modify the protocol to expect a different
acknowledgment for every packet sent to cpucp.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-02-08 18:20:08 +02:00
Oded Gabbay f1aebf5e3d habanalabs: update to latest hl_boot_if.h spec from F/W
It adds the definition for indication that the F/W handles HBM ECC
events.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:51 +02:00
Oded Gabbay 7838504171 habanalabs: update SyncManager interrupt handling
The firmware provides more information about SyncManager events.
Adjust the code to the latest firmware interface file.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:51 +02:00
Ohad Sharabi 663a301d75 habanalabs: fix ETR security issue
ETR should always be non-secured as it is used by the users to record
profiling/trace data.
This patch fixes the configuration to match those requirements.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:51 +02:00
Ofir Bitton f8bc7f091c habanalabs/gaudi: print sync manager SEI interrupt info
Driver must print sync manager SEI information upon receiving
interrupt from FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton 423815bf02 habanalabs/gaudi: remove PCI access to SM block
Due to HW limitation we must remove all direct access to SM
registers, in order to do that we will access SM registers using
the HW QMANS.
When possible and no user context is present, we can directly access
the HW QMANS. Whenever there is an active user, driver will
prepare a pending command buffer list which will be sent upon
user submissions.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton edb07cb69c habanalabs: read device boot errors after cpucp is up
Boot cpu can report errors in various boot stages.
Current implementaion does not take into consideration errors
reported in late stages, hence we will check for errors at the most
late stage when fetching cpucp information.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Oded Gabbay 3abe1040ba habanalabs: update to latest hl_boot_if.h
Update the latest version of this file that the F/W exports

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ofir Bitton f8b0f2ecc5 habanalabs/gaudi: remove duplicated gaudi packets masks
As all packets use the same CTL register masks, we remove duplicated
masks and use common masks instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Oded Gabbay 4c998836d4 habanalabs: update firmware boot interface
Update to latest firmware hl_boot_if.h file.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Oded Gabbay 90ffe170a3 habanalabs: update comment in hl_boot_if.h
Hard-reset flag is updated in many stages of the boot sequence of the
firmware.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Oded Gabbay 0024c09485 habanalabs/gaudi: disable CGM at HW initialization
In case the clock gating was enabled in preboot we need to disable it
at the H/W initialization stage before touching the MME/TPC registers.
Otherwise, the ASIC can get stuck. If the security is enabled in
the firmware level, the CGM is always disabled and the driver can't
enable it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Ofir Bitton 9c9013cbd8 habanalabs: preboot hard reset support
FW hard reset capability indication is now moved to preboot stage.
Driver will check if HW is dirty only after it validated preboot
is up. If HW is dirty, driver will perform a hard reset according
to the FW capability.
In addition, FW defines a new message which driver need to send in
order to initiate a hard reset.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Alon Mizrahi d2bbf2ca33 habanalabs: add ull to PLL masks
These defines are 64-bit defines so they need ull suffix.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:37 +02:00
Oded Gabbay 051504d9f6 habanalabs: update firmware files
Update various firmware header files with new defines.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Alon Mizrahi 4147864e8d habanalabs: fetch pll frequency from firmware
Once firmware security is enabled, driver must fetch pll frequencies
through the firmware message interface instead of reading the registers
directly.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Ofir Bitton 5a2998f46c habanalabs/gaudi: fetch HBM ecc info from FW
Once FW security is enabled there is no access to HBM ecc registers,
need to read values from FW using a dedicated interface.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton d611b9f0b1 habanalabs: fetch hard reset capability from FW
Driver must fetch FW hard reset capability during boot time,
in order to skip the hard reset flow if necessary.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton 323b726706 habanalabs: fetch security indication from FW
Add support for fetching security indication from FW.
This indication is needed in order to skip unnecessary
initializations done by FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Oded Gabbay b3a9c0bd2f habanalabs/gaudi: add NIC firmware-related definitions
Add new structures and messages that the driver use to interact with the
firmware to receive information and events (errors) about GAUDI's NIC.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:29 +02:00