OpenCloudOS-Kernel/drivers/dma/idxd/idxd.h

613 lines
15 KiB
C
Raw Normal View History

/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
#ifndef _IDXD_H_
#define _IDXD_H_
#include <linux/sbitmap.h>
#include <linux/dmaengine.h>
#include <linux/percpu-rwsem.h>
#include <linux/wait.h>
#include <linux/cdev.h>
#include <linux/idr.h>
dmaengine: idxd: Add IDXD performance monitor support Implement the IDXD performance monitor capability (named 'perfmon' in the DSA (Data Streaming Accelerator) spec [1]), which supports the collection of information about key events occurring during DSA and IAX (Intel Analytics Accelerator) device execution, to assist in performance tuning and debugging. The idxd perfmon support is implemented as part of the IDXD driver and interfaces with the Linux perf framework. It has several features in common with the existing uncore pmu support: - it does not support sampling - does not support per-thread counting However it also has some unique features not present in the core and uncore support: - all general-purpose counters are identical, thus no event constraints - operation is always system-wide While the core perf subsystem assumes that all counters are by default per-cpu, the uncore pmus are socket-scoped and use a cpu mask to restrict counting to one cpu from each socket. IDXD counters use a similar strategy but expand the scope even further; since IDXD counters are system-wide and can be read from any cpu, the IDXD perf driver picks a single cpu to do the work (with cpu hotplug notifiers to choose a different cpu if the chosen one is taken off-line). More specifically, the perf userspace tool by default opens a counter for each cpu for an event. However, if it finds a cpumask file associated with the pmu under sysfs, as is the case with the uncore pmus, it will open counters only on the cpus specified by the cpumask. Since perfmon only needs to open a single counter per event for a given IDXD device, the perfmon driver will create a sysfs cpumask file for the device and insert the first cpu of the system into it. When a user uses perf to open an event, perf will open a single counter on the cpu specified by the cpu mask. This amounts to the default system-wide rather than per-cpu counting mentioned previously for perfmon pmu events. In order to keep the cpu mask up-to-date, the driver implements cpu hotplug support for multiple devices, as IDXD usually enumerates and registers more than one idxd device. The perfmon driver implements basic perfmon hardware capability discovery and configuration, and is initialized by the IDXD driver's probe function. During initialization, the driver retrieves the total number of supported performance counters, the pmu ID, and the device type from idxd device, and registers itself under the Linux perf framework. The perf userspace tool can be used to monitor single or multiple events depending on the given configuration, as well as event groups, which are also supported by the perfmon driver. The user configures events using the perf tool command-line interface by specifying the event and corresponding event category, along with an optional set of filters that can be used to restrict counting to specific work queues, traffic classes, page and transfer sizes, and engines (See [1] for specifics). With the configuration specified by the user, the perf tool issues a system call passing that information to the kernel, which uses it to initialize the specified event(s). The event(s) are opened and started, and following termination of the perf command, they're stopped. At that point, the perfmon driver will read the latest count for the event(s), calculate the difference between the latest counter values and previously tracked counter values, and display the final incremental count as the event count for the cycle. An overflow handler registered on the IDXD irq path is used to account for counter overflows, which are signaled by an overflow interrupt. Below are a couple of examples of perf usage for monitoring DSA events. The following monitors all events in the 'engine' category. Becuuse no filters are specified, this captures all engine events for the workload, which in this case is 19 iterations of the work generated by the kernel dmatest module. Details describing the events can be found in Appendix D of [1], Performance Monitoring Events, but briefly they are: event 0x1: total input data processed, in 32-byte units event 0x2: total data written, in 32-byte units event 0x4: number of work descriptors that read the source event 0x8: number of work descriptors that write the destination event 0x10: number of work descriptors dispatched from batch descriptors event 0x20: number of work descriptors dispatched from work queues # perf stat -e dsa0/event=0x1,event_category=0x1/, dsa0/event=0x2,event_category=0x1/, dsa0/event=0x4,event_category=0x1/, dsa0/event=0x8,event_category=0x1/, dsa0/event=0x10,event_category=0x1/, dsa0/event=0x20,event_category=0x1/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 5,332 dsa0/event=0x1,event_category=0x1/ 5,327 dsa0/event=0x2,event_category=0x1/ 19 dsa0/event=0x4,event_category=0x1/ 19 dsa0/event=0x8,event_category=0x1/ 0 dsa0/event=0x10,event_category=0x1/ 19 dsa0/event=0x20,event_category=0x1/ 21.977436186 seconds time elapsed The command below illustrates filter usage with a simple example. It specifies that MEM_MOVE operations should be counted for the DSA device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number of Memory Move Descriptors, which is part of event category 0x3 - Operations. The detailed category and event IDs are available in Appendix D, Performance Monitoring Events, of [1]). In addition to the event and event category, a number of filters are also specified (the detailed filter values are available in Chapter 6.4 (Filter Support) of [1]), which will restrict counting to only those events that meet all of the filter criteria. In this case, the filters specify that only MEM_MOVE operations that are serviced by work queue wq0 and specifically engine number engine0 and traffic class tc0 having sizes between 0 and 4k and page size of between 0 and 1G result in a counter hit; anything else will be filtered out and not appear in the final count. Note that filters are optional - any filter not specified is assumed to be all ones and will pass anything. # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 19 dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ 21.865914091 seconds time elapsed The output above reflects that the unspecified workload resulted in the counting of 19 MEM_MOVE operation events that met the filter criteria. [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html [ Based on work originally by Jing Lin. ] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-04-24 23:04:15 +08:00
#include <linux/pci.h>
#include <linux/perf_event.h>
#include <uapi/linux/idxd.h>
#include "registers.h"
#define IDXD_DRIVER_VERSION "1.00"
extern struct kmem_cache *idxd_desc_pool;
extern bool tc_override;
struct idxd_wq;
struct idxd_dev;
enum idxd_dev_type {
IDXD_DEV_NONE = -1,
IDXD_DEV_DSA = 0,
IDXD_DEV_IAX,
IDXD_DEV_WQ,
IDXD_DEV_GROUP,
IDXD_DEV_ENGINE,
IDXD_DEV_CDEV,
IDXD_DEV_MAX_TYPE,
};
struct idxd_dev {
struct device conf_dev;
enum idxd_dev_type type;
};
#define IDXD_REG_TIMEOUT 50
#define IDXD_DRAIN_TIMEOUT 5000
enum idxd_type {
IDXD_TYPE_UNKNOWN = -1,
IDXD_TYPE_DSA = 0,
IDXD_TYPE_IAX,
IDXD_TYPE_MAX,
};
#define IDXD_NAME_SIZE 128
dmaengine: idxd: Add IDXD performance monitor support Implement the IDXD performance monitor capability (named 'perfmon' in the DSA (Data Streaming Accelerator) spec [1]), which supports the collection of information about key events occurring during DSA and IAX (Intel Analytics Accelerator) device execution, to assist in performance tuning and debugging. The idxd perfmon support is implemented as part of the IDXD driver and interfaces with the Linux perf framework. It has several features in common with the existing uncore pmu support: - it does not support sampling - does not support per-thread counting However it also has some unique features not present in the core and uncore support: - all general-purpose counters are identical, thus no event constraints - operation is always system-wide While the core perf subsystem assumes that all counters are by default per-cpu, the uncore pmus are socket-scoped and use a cpu mask to restrict counting to one cpu from each socket. IDXD counters use a similar strategy but expand the scope even further; since IDXD counters are system-wide and can be read from any cpu, the IDXD perf driver picks a single cpu to do the work (with cpu hotplug notifiers to choose a different cpu if the chosen one is taken off-line). More specifically, the perf userspace tool by default opens a counter for each cpu for an event. However, if it finds a cpumask file associated with the pmu under sysfs, as is the case with the uncore pmus, it will open counters only on the cpus specified by the cpumask. Since perfmon only needs to open a single counter per event for a given IDXD device, the perfmon driver will create a sysfs cpumask file for the device and insert the first cpu of the system into it. When a user uses perf to open an event, perf will open a single counter on the cpu specified by the cpu mask. This amounts to the default system-wide rather than per-cpu counting mentioned previously for perfmon pmu events. In order to keep the cpu mask up-to-date, the driver implements cpu hotplug support for multiple devices, as IDXD usually enumerates and registers more than one idxd device. The perfmon driver implements basic perfmon hardware capability discovery and configuration, and is initialized by the IDXD driver's probe function. During initialization, the driver retrieves the total number of supported performance counters, the pmu ID, and the device type from idxd device, and registers itself under the Linux perf framework. The perf userspace tool can be used to monitor single or multiple events depending on the given configuration, as well as event groups, which are also supported by the perfmon driver. The user configures events using the perf tool command-line interface by specifying the event and corresponding event category, along with an optional set of filters that can be used to restrict counting to specific work queues, traffic classes, page and transfer sizes, and engines (See [1] for specifics). With the configuration specified by the user, the perf tool issues a system call passing that information to the kernel, which uses it to initialize the specified event(s). The event(s) are opened and started, and following termination of the perf command, they're stopped. At that point, the perfmon driver will read the latest count for the event(s), calculate the difference between the latest counter values and previously tracked counter values, and display the final incremental count as the event count for the cycle. An overflow handler registered on the IDXD irq path is used to account for counter overflows, which are signaled by an overflow interrupt. Below are a couple of examples of perf usage for monitoring DSA events. The following monitors all events in the 'engine' category. Becuuse no filters are specified, this captures all engine events for the workload, which in this case is 19 iterations of the work generated by the kernel dmatest module. Details describing the events can be found in Appendix D of [1], Performance Monitoring Events, but briefly they are: event 0x1: total input data processed, in 32-byte units event 0x2: total data written, in 32-byte units event 0x4: number of work descriptors that read the source event 0x8: number of work descriptors that write the destination event 0x10: number of work descriptors dispatched from batch descriptors event 0x20: number of work descriptors dispatched from work queues # perf stat -e dsa0/event=0x1,event_category=0x1/, dsa0/event=0x2,event_category=0x1/, dsa0/event=0x4,event_category=0x1/, dsa0/event=0x8,event_category=0x1/, dsa0/event=0x10,event_category=0x1/, dsa0/event=0x20,event_category=0x1/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 5,332 dsa0/event=0x1,event_category=0x1/ 5,327 dsa0/event=0x2,event_category=0x1/ 19 dsa0/event=0x4,event_category=0x1/ 19 dsa0/event=0x8,event_category=0x1/ 0 dsa0/event=0x10,event_category=0x1/ 19 dsa0/event=0x20,event_category=0x1/ 21.977436186 seconds time elapsed The command below illustrates filter usage with a simple example. It specifies that MEM_MOVE operations should be counted for the DSA device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number of Memory Move Descriptors, which is part of event category 0x3 - Operations. The detailed category and event IDs are available in Appendix D, Performance Monitoring Events, of [1]). In addition to the event and event category, a number of filters are also specified (the detailed filter values are available in Chapter 6.4 (Filter Support) of [1]), which will restrict counting to only those events that meet all of the filter criteria. In this case, the filters specify that only MEM_MOVE operations that are serviced by work queue wq0 and specifically engine number engine0 and traffic class tc0 having sizes between 0 and 4k and page size of between 0 and 1G result in a counter hit; anything else will be filtered out and not appear in the final count. Note that filters are optional - any filter not specified is assumed to be all ones and will pass anything. # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 19 dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ 21.865914091 seconds time elapsed The output above reflects that the unspecified workload resulted in the counting of 19 MEM_MOVE operation events that met the filter criteria. [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html [ Based on work originally by Jing Lin. ] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-04-24 23:04:15 +08:00
#define IDXD_PMU_EVENT_MAX 64
struct idxd_device_driver {
const char *name;
enum idxd_dev_type *type;
int (*probe)(struct idxd_dev *idxd_dev);
void (*remove)(struct idxd_dev *idxd_dev);
struct device_driver drv;
};
extern struct idxd_device_driver dsa_drv;
extern struct idxd_device_driver idxd_drv;
dmaengine: idxd: create dmaengine driver for wq 'device' The original architecture of /sys/bus/dsa invented a scheme whereby a single entry in the list of bus drivers, /sys/bus/drivers/dsa, handled all device types and internally routed them to different drivers. Those internal drivers were invisible to userspace. Now, as /sys/bus/dsa wants to grow support for alternate drivers for a given device, for example vfio-mdev instead of kernel-internal-dmaengine, a proper bus device-driver model is needed. The first step in that process is separating the existing omnibus/implicit "dsa" driver into proper individual drivers registered on /sys/bus/dsa. Establish the idxd_dmaengine_drv driver that controls the enabling and disabling of the wq and also register and unregister the dma channel. idxd_wq_alloc_resources() and idxd_wq_free_resources() also get moved to the dmaengine driver. The resources (dma descriptors allocation and setup) are only used by the dmaengine driver and should only happen when it loads. The char dev driver (cdev) related bits are left in the __drv_enable_wq() and __drv_disable_wq() calls to be moved when we split out the char dev driver just like how the dmaengine driver is split out. WQ autoload support is not expected currently. With the amount of configuration needed for the device, the wq is always expected to be enabled by a tool (or via sysfs) rather than auto enabled at driver load. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/162637467033.744545.12330636655625405394.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-07-16 02:44:30 +08:00
extern struct idxd_device_driver idxd_dmaengine_drv;
extern struct idxd_device_driver idxd_user_drv;
struct idxd_irq_entry {
struct idxd_device *idxd;
int id;
int vector;
struct llist_head pending_llist;
struct list_head work_list;
/*
* Lock to protect access between irq thread process descriptor
* and irq thread processing error descriptor.
*/
spinlock_t list_lock;
};
struct idxd_group {
struct idxd_dev idxd_dev;
struct idxd_device *idxd;
struct grpcfg grpcfg;
int id;
int num_engines;
int num_wqs;
bool use_token_limit;
u8 tokens_allowed;
u8 tokens_reserved;
int tc_a;
int tc_b;
};
dmaengine: idxd: Add IDXD performance monitor support Implement the IDXD performance monitor capability (named 'perfmon' in the DSA (Data Streaming Accelerator) spec [1]), which supports the collection of information about key events occurring during DSA and IAX (Intel Analytics Accelerator) device execution, to assist in performance tuning and debugging. The idxd perfmon support is implemented as part of the IDXD driver and interfaces with the Linux perf framework. It has several features in common with the existing uncore pmu support: - it does not support sampling - does not support per-thread counting However it also has some unique features not present in the core and uncore support: - all general-purpose counters are identical, thus no event constraints - operation is always system-wide While the core perf subsystem assumes that all counters are by default per-cpu, the uncore pmus are socket-scoped and use a cpu mask to restrict counting to one cpu from each socket. IDXD counters use a similar strategy but expand the scope even further; since IDXD counters are system-wide and can be read from any cpu, the IDXD perf driver picks a single cpu to do the work (with cpu hotplug notifiers to choose a different cpu if the chosen one is taken off-line). More specifically, the perf userspace tool by default opens a counter for each cpu for an event. However, if it finds a cpumask file associated with the pmu under sysfs, as is the case with the uncore pmus, it will open counters only on the cpus specified by the cpumask. Since perfmon only needs to open a single counter per event for a given IDXD device, the perfmon driver will create a sysfs cpumask file for the device and insert the first cpu of the system into it. When a user uses perf to open an event, perf will open a single counter on the cpu specified by the cpu mask. This amounts to the default system-wide rather than per-cpu counting mentioned previously for perfmon pmu events. In order to keep the cpu mask up-to-date, the driver implements cpu hotplug support for multiple devices, as IDXD usually enumerates and registers more than one idxd device. The perfmon driver implements basic perfmon hardware capability discovery and configuration, and is initialized by the IDXD driver's probe function. During initialization, the driver retrieves the total number of supported performance counters, the pmu ID, and the device type from idxd device, and registers itself under the Linux perf framework. The perf userspace tool can be used to monitor single or multiple events depending on the given configuration, as well as event groups, which are also supported by the perfmon driver. The user configures events using the perf tool command-line interface by specifying the event and corresponding event category, along with an optional set of filters that can be used to restrict counting to specific work queues, traffic classes, page and transfer sizes, and engines (See [1] for specifics). With the configuration specified by the user, the perf tool issues a system call passing that information to the kernel, which uses it to initialize the specified event(s). The event(s) are opened and started, and following termination of the perf command, they're stopped. At that point, the perfmon driver will read the latest count for the event(s), calculate the difference between the latest counter values and previously tracked counter values, and display the final incremental count as the event count for the cycle. An overflow handler registered on the IDXD irq path is used to account for counter overflows, which are signaled by an overflow interrupt. Below are a couple of examples of perf usage for monitoring DSA events. The following monitors all events in the 'engine' category. Becuuse no filters are specified, this captures all engine events for the workload, which in this case is 19 iterations of the work generated by the kernel dmatest module. Details describing the events can be found in Appendix D of [1], Performance Monitoring Events, but briefly they are: event 0x1: total input data processed, in 32-byte units event 0x2: total data written, in 32-byte units event 0x4: number of work descriptors that read the source event 0x8: number of work descriptors that write the destination event 0x10: number of work descriptors dispatched from batch descriptors event 0x20: number of work descriptors dispatched from work queues # perf stat -e dsa0/event=0x1,event_category=0x1/, dsa0/event=0x2,event_category=0x1/, dsa0/event=0x4,event_category=0x1/, dsa0/event=0x8,event_category=0x1/, dsa0/event=0x10,event_category=0x1/, dsa0/event=0x20,event_category=0x1/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 5,332 dsa0/event=0x1,event_category=0x1/ 5,327 dsa0/event=0x2,event_category=0x1/ 19 dsa0/event=0x4,event_category=0x1/ 19 dsa0/event=0x8,event_category=0x1/ 0 dsa0/event=0x10,event_category=0x1/ 19 dsa0/event=0x20,event_category=0x1/ 21.977436186 seconds time elapsed The command below illustrates filter usage with a simple example. It specifies that MEM_MOVE operations should be counted for the DSA device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number of Memory Move Descriptors, which is part of event category 0x3 - Operations. The detailed category and event IDs are available in Appendix D, Performance Monitoring Events, of [1]). In addition to the event and event category, a number of filters are also specified (the detailed filter values are available in Chapter 6.4 (Filter Support) of [1]), which will restrict counting to only those events that meet all of the filter criteria. In this case, the filters specify that only MEM_MOVE operations that are serviced by work queue wq0 and specifically engine number engine0 and traffic class tc0 having sizes between 0 and 4k and page size of between 0 and 1G result in a counter hit; anything else will be filtered out and not appear in the final count. Note that filters are optional - any filter not specified is assumed to be all ones and will pass anything. # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 19 dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ 21.865914091 seconds time elapsed The output above reflects that the unspecified workload resulted in the counting of 19 MEM_MOVE operation events that met the filter criteria. [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html [ Based on work originally by Jing Lin. ] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-04-24 23:04:15 +08:00
struct idxd_pmu {
struct idxd_device *idxd;
struct perf_event *event_list[IDXD_PMU_EVENT_MAX];
int n_events;
DECLARE_BITMAP(used_mask, IDXD_PMU_EVENT_MAX);
struct pmu pmu;
char name[IDXD_NAME_SIZE];
int cpu;
int n_counters;
int counter_width;
int n_event_categories;
bool per_counter_caps_supported;
unsigned long supported_event_categories;
unsigned long supported_filters;
int n_filters;
struct hlist_node cpuhp_node;
};
#define IDXD_MAX_PRIORITY 0xf
enum idxd_wq_state {
IDXD_WQ_DISABLED = 0,
IDXD_WQ_ENABLED,
};
enum idxd_wq_flag {
WQ_FLAG_DEDICATED = 0,
WQ_FLAG_BLOCK_ON_FAULT,
};
enum idxd_wq_type {
IDXD_WQT_NONE = 0,
IDXD_WQT_KERNEL,
IDXD_WQT_USER,
};
struct idxd_cdev {
struct idxd_wq *wq;
struct cdev cdev;
struct idxd_dev idxd_dev;
int minor;
};
#define IDXD_ALLOCATED_BATCH_SIZE 128U
#define WQ_NAME_SIZE 1024
#define WQ_TYPE_SIZE 10
enum idxd_op_type {
IDXD_OP_BLOCK = 0,
IDXD_OP_NONBLOCK = 1,
};
enum idxd_complete_type {
IDXD_COMPLETE_NORMAL = 0,
IDXD_COMPLETE_ABORT,
IDXD_COMPLETE_DEV_FAIL,
};
struct idxd_dma_chan {
struct dma_chan chan;
struct idxd_wq *wq;
};
struct idxd_wq {
void __iomem *portal;
u32 portal_offset;
struct percpu_ref wq_active;
struct completion wq_dead;
struct idxd_dev idxd_dev;
struct idxd_cdev *idxd_cdev;
struct wait_queue_head err_queue;
struct idxd_device *idxd;
int id;
enum idxd_wq_type type;
struct idxd_group *group;
int client_count;
struct mutex wq_lock; /* mutex for workqueue */
u32 size;
u32 threshold;
u32 priority;
enum idxd_wq_state state;
unsigned long flags;
union wqcfg *wqcfg;
struct dsa_hw_desc **hw_descs;
int num_descs;
union {
struct dsa_completion_record *compls;
struct iax_completion_record *iax_compls;
};
dma_addr_t compls_addr;
int compls_size;
struct idxd_desc **descs;
struct sbitmap_queue sbq;
struct idxd_dma_chan *idxd_chan;
char name[WQ_NAME_SIZE + 1];
u64 max_xfer_bytes;
u32 max_batch_size;
bool ats_dis;
};
struct idxd_engine {
struct idxd_dev idxd_dev;
int id;
struct idxd_group *group;
struct idxd_device *idxd;
};
/* shadow registers */
struct idxd_hw {
u32 version;
union gen_cap_reg gen_cap;
union wq_cap_reg wq_cap;
union group_cap_reg group_cap;
union engine_cap_reg engine_cap;
struct opcap opcap;
u32 cmd_cap;
};
enum idxd_device_state {
IDXD_DEV_HALTED = -1,
IDXD_DEV_DISABLED = 0,
IDXD_DEV_ENABLED,
};
enum idxd_device_flag {
IDXD_FLAG_CONFIGURABLE = 0,
IDXD_FLAG_CMD_RUNNING,
IDXD_FLAG_PASID_ENABLED,
};
struct idxd_dma_dev {
struct idxd_device *idxd;
struct dma_device dma;
};
struct idxd_driver_data {
const char *name_prefix;
enum idxd_type type;
struct device_type *dev_type;
int compl_size;
int align;
};
struct idxd_device {
struct idxd_dev idxd_dev;
struct idxd_driver_data *data;
struct list_head list;
struct idxd_hw hw;
enum idxd_device_state state;
unsigned long flags;
int id;
int major;
u32 cmd_status;
struct pci_dev *pdev;
void __iomem *reg_base;
spinlock_t dev_lock; /* spinlock for device */
spinlock_t cmd_lock; /* spinlock for device commands */
struct completion *cmd_done;
struct idxd_group **groups;
struct idxd_wq **wqs;
struct idxd_engine **engines;
struct iommu_sva *sva;
unsigned int pasid;
int num_groups;
u32 msix_perm_offset;
u32 wqcfg_offset;
u32 grpcfg_offset;
u32 perfmon_offset;
u64 max_xfer_bytes;
u32 max_batch_size;
int max_groups;
int max_engines;
int max_tokens;
int max_wqs;
int max_wq_size;
int token_limit;
int nr_tokens; /* non-reserved tokens */
unsigned int wqcfg_size;
union sw_err_reg sw_err;
wait_queue_head_t cmd_waitq;
int num_wq_irqs;
struct idxd_irq_entry *irq_entries;
struct idxd_dma_dev *idxd_dma;
struct workqueue_struct *wq;
struct work_struct work;
int *int_handles;
dmaengine: idxd: Add IDXD performance monitor support Implement the IDXD performance monitor capability (named 'perfmon' in the DSA (Data Streaming Accelerator) spec [1]), which supports the collection of information about key events occurring during DSA and IAX (Intel Analytics Accelerator) device execution, to assist in performance tuning and debugging. The idxd perfmon support is implemented as part of the IDXD driver and interfaces with the Linux perf framework. It has several features in common with the existing uncore pmu support: - it does not support sampling - does not support per-thread counting However it also has some unique features not present in the core and uncore support: - all general-purpose counters are identical, thus no event constraints - operation is always system-wide While the core perf subsystem assumes that all counters are by default per-cpu, the uncore pmus are socket-scoped and use a cpu mask to restrict counting to one cpu from each socket. IDXD counters use a similar strategy but expand the scope even further; since IDXD counters are system-wide and can be read from any cpu, the IDXD perf driver picks a single cpu to do the work (with cpu hotplug notifiers to choose a different cpu if the chosen one is taken off-line). More specifically, the perf userspace tool by default opens a counter for each cpu for an event. However, if it finds a cpumask file associated with the pmu under sysfs, as is the case with the uncore pmus, it will open counters only on the cpus specified by the cpumask. Since perfmon only needs to open a single counter per event for a given IDXD device, the perfmon driver will create a sysfs cpumask file for the device and insert the first cpu of the system into it. When a user uses perf to open an event, perf will open a single counter on the cpu specified by the cpu mask. This amounts to the default system-wide rather than per-cpu counting mentioned previously for perfmon pmu events. In order to keep the cpu mask up-to-date, the driver implements cpu hotplug support for multiple devices, as IDXD usually enumerates and registers more than one idxd device. The perfmon driver implements basic perfmon hardware capability discovery and configuration, and is initialized by the IDXD driver's probe function. During initialization, the driver retrieves the total number of supported performance counters, the pmu ID, and the device type from idxd device, and registers itself under the Linux perf framework. The perf userspace tool can be used to monitor single or multiple events depending on the given configuration, as well as event groups, which are also supported by the perfmon driver. The user configures events using the perf tool command-line interface by specifying the event and corresponding event category, along with an optional set of filters that can be used to restrict counting to specific work queues, traffic classes, page and transfer sizes, and engines (See [1] for specifics). With the configuration specified by the user, the perf tool issues a system call passing that information to the kernel, which uses it to initialize the specified event(s). The event(s) are opened and started, and following termination of the perf command, they're stopped. At that point, the perfmon driver will read the latest count for the event(s), calculate the difference between the latest counter values and previously tracked counter values, and display the final incremental count as the event count for the cycle. An overflow handler registered on the IDXD irq path is used to account for counter overflows, which are signaled by an overflow interrupt. Below are a couple of examples of perf usage for monitoring DSA events. The following monitors all events in the 'engine' category. Becuuse no filters are specified, this captures all engine events for the workload, which in this case is 19 iterations of the work generated by the kernel dmatest module. Details describing the events can be found in Appendix D of [1], Performance Monitoring Events, but briefly they are: event 0x1: total input data processed, in 32-byte units event 0x2: total data written, in 32-byte units event 0x4: number of work descriptors that read the source event 0x8: number of work descriptors that write the destination event 0x10: number of work descriptors dispatched from batch descriptors event 0x20: number of work descriptors dispatched from work queues # perf stat -e dsa0/event=0x1,event_category=0x1/, dsa0/event=0x2,event_category=0x1/, dsa0/event=0x4,event_category=0x1/, dsa0/event=0x8,event_category=0x1/, dsa0/event=0x10,event_category=0x1/, dsa0/event=0x20,event_category=0x1/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 5,332 dsa0/event=0x1,event_category=0x1/ 5,327 dsa0/event=0x2,event_category=0x1/ 19 dsa0/event=0x4,event_category=0x1/ 19 dsa0/event=0x8,event_category=0x1/ 0 dsa0/event=0x10,event_category=0x1/ 19 dsa0/event=0x20,event_category=0x1/ 21.977436186 seconds time elapsed The command below illustrates filter usage with a simple example. It specifies that MEM_MOVE operations should be counted for the DSA device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number of Memory Move Descriptors, which is part of event category 0x3 - Operations. The detailed category and event IDs are available in Appendix D, Performance Monitoring Events, of [1]). In addition to the event and event category, a number of filters are also specified (the detailed filter values are available in Chapter 6.4 (Filter Support) of [1]), which will restrict counting to only those events that meet all of the filter criteria. In this case, the filters specify that only MEM_MOVE operations that are serviced by work queue wq0 and specifically engine number engine0 and traffic class tc0 having sizes between 0 and 4k and page size of between 0 and 1G result in a counter hit; anything else will be filtered out and not appear in the final count. Note that filters are optional - any filter not specified is assumed to be all ones and will pass anything. # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 19 dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ 21.865914091 seconds time elapsed The output above reflects that the unspecified workload resulted in the counting of 19 MEM_MOVE operation events that met the filter criteria. [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html [ Based on work originally by Jing Lin. ] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-04-24 23:04:15 +08:00
struct idxd_pmu *idxd_pmu;
};
/* IDXD software descriptor */
struct idxd_desc {
union {
struct dsa_hw_desc *hw;
struct iax_hw_desc *iax_hw;
};
dma_addr_t desc_dma;
union {
struct dsa_completion_record *completion;
struct iax_completion_record *iax_completion;
};
dma_addr_t compl_dma;
struct dma_async_tx_descriptor txd;
struct llist_node llnode;
struct list_head list;
int id;
int cpu;
struct idxd_wq *wq;
};
dmaengine: idxd: fix submission race window Konstantin observed that when descriptors are submitted, the descriptor is added to the pending list after the submission. This creates a race window with the slight possibility that the descriptor can complete before it gets added to the pending list and this window would cause the completion handler to miss processing the descriptor. To address the issue, the addition of the descriptor to the pending list must be done before it gets submitted to the hardware. However, submitting to swq with ENQCMDS instruction can cause a failure with the condition of either wq is full or wq is not "active". With the descriptor allocation being the gate to the wq capacity, it is not possible to hit a retry with ENQCMDS submission to the swq. The only possible failure can happen is when wq is no longer "active" due to hw error and therefore we are moving towards taking down the portal. Given this is a rare condition and there's no longer concern over I/O performance, the driver can walk the completion lists in order to retrieve and abort the descriptor. The error path will set the descriptor to aborted status. It will take the work list lock to prevent further processing of worklist. It will do a delete_all on the pending llist to retrieve all descriptors on the pending llist. The delete_all action does not require a lock. It will walk through the acquired llist to find the aborted descriptor while add all remaining descriptors to the work list since it holds the lock. If it does not find the aborted descriptor on the llist, it will walk through the work list. And if it still does not find the descriptor, then it means the interrupt handler has removed the desc from the llist but is pending on the work list lock and will process it once the error path releases the lock. Fixes: eb15e7154fbf ("dmaengine: idxd: add interrupt handle request and release support") Reported-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/162628855747.360485.10101925573082466530.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-07-15 02:50:06 +08:00
/*
* This is software defined error for the completion status. We overload the error code
* that will never appear in completion status and only SWERR register.
*/
enum idxd_completion_status {
IDXD_COMP_DESC_ABORT = 0xff,
};
#define idxd_confdev(idxd) &idxd->idxd_dev.conf_dev
#define wq_confdev(wq) &wq->idxd_dev.conf_dev
#define engine_confdev(engine) &engine->idxd_dev.conf_dev
#define group_confdev(group) &group->idxd_dev.conf_dev
#define cdev_dev(cdev) &cdev->idxd_dev.conf_dev
#define confdev_to_idxd_dev(dev) container_of(dev, struct idxd_dev, conf_dev)
#define idxd_dev_to_idxd(idxd_dev) container_of(idxd_dev, struct idxd_device, idxd_dev)
#define idxd_dev_to_wq(idxd_dev) container_of(idxd_dev, struct idxd_wq, idxd_dev)
static inline struct idxd_device *confdev_to_idxd(struct device *dev)
{
struct idxd_dev *idxd_dev = confdev_to_idxd_dev(dev);
return idxd_dev_to_idxd(idxd_dev);
}
static inline struct idxd_wq *confdev_to_wq(struct device *dev)
{
struct idxd_dev *idxd_dev = confdev_to_idxd_dev(dev);
return idxd_dev_to_wq(idxd_dev);
}
static inline struct idxd_engine *confdev_to_engine(struct device *dev)
{
struct idxd_dev *idxd_dev = confdev_to_idxd_dev(dev);
return container_of(idxd_dev, struct idxd_engine, idxd_dev);
}
static inline struct idxd_group *confdev_to_group(struct device *dev)
{
struct idxd_dev *idxd_dev = confdev_to_idxd_dev(dev);
return container_of(idxd_dev, struct idxd_group, idxd_dev);
}
static inline struct idxd_cdev *dev_to_cdev(struct device *dev)
{
struct idxd_dev *idxd_dev = confdev_to_idxd_dev(dev);
return container_of(idxd_dev, struct idxd_cdev, idxd_dev);
}
static inline void idxd_dev_set_type(struct idxd_dev *idev, int type)
{
if (type >= IDXD_DEV_MAX_TYPE) {
idev->type = IDXD_DEV_NONE;
return;
}
idev->type = type;
}
extern struct bus_type dsa_bus_type;
extern bool support_enqcmd;
extern struct ida idxd_ida;
extern struct device_type dsa_device_type;
extern struct device_type iax_device_type;
extern struct device_type idxd_wq_device_type;
extern struct device_type idxd_engine_device_type;
extern struct device_type idxd_group_device_type;
static inline bool is_dsa_dev(struct idxd_dev *idxd_dev)
{
return idxd_dev->type == IDXD_DEV_DSA;
}
static inline bool is_iax_dev(struct idxd_dev *idxd_dev)
{
return idxd_dev->type == IDXD_DEV_IAX;
}
static inline bool is_idxd_dev(struct idxd_dev *idxd_dev)
{
return is_dsa_dev(idxd_dev) || is_iax_dev(idxd_dev);
}
static inline bool is_idxd_wq_dev(struct idxd_dev *idxd_dev)
{
return idxd_dev->type == IDXD_DEV_WQ;
}
static inline bool is_idxd_wq_dmaengine(struct idxd_wq *wq)
{
if (wq->type == IDXD_WQT_KERNEL && strcmp(wq->name, "dmaengine") == 0)
return true;
return false;
}
static inline bool is_idxd_wq_user(struct idxd_wq *wq)
{
return wq->type == IDXD_WQT_USER;
}
static inline bool is_idxd_wq_kernel(struct idxd_wq *wq)
{
return wq->type == IDXD_WQT_KERNEL;
}
static inline bool wq_dedicated(struct idxd_wq *wq)
{
return test_bit(WQ_FLAG_DEDICATED, &wq->flags);
}
static inline bool wq_shared(struct idxd_wq *wq)
{
return !test_bit(WQ_FLAG_DEDICATED, &wq->flags);
}
static inline bool device_pasid_enabled(struct idxd_device *idxd)
{
return test_bit(IDXD_FLAG_PASID_ENABLED, &idxd->flags);
}
static inline bool device_swq_supported(struct idxd_device *idxd)
{
return (support_enqcmd && device_pasid_enabled(idxd));
}
enum idxd_portal_prot {
IDXD_PORTAL_UNLIMITED = 0,
IDXD_PORTAL_LIMITED,
};
enum idxd_interrupt_type {
IDXD_IRQ_MSIX = 0,
IDXD_IRQ_IMS,
};
static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
{
return prot * 0x1000;
}
static inline int idxd_get_wq_portal_full_offset(int wq_id,
enum idxd_portal_prot prot)
{
return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
}
#define IDXD_PORTAL_MASK (PAGE_SIZE - 1)
/*
* Even though this function can be accessed by multiple threads, it is safe to use.
* At worst the address gets used more than once before it gets incremented. We don't
* hit a threshold until iops becomes many million times a second. So the occasional
* reuse of the same address is tolerable compare to using an atomic variable. This is
* safe on a system that has atomic load/store for 32bit integers. Given that this is an
* Intel iEP device, that should not be a problem.
*/
static inline void __iomem *idxd_wq_portal_addr(struct idxd_wq *wq)
{
int ofs = wq->portal_offset;
wq->portal_offset = (ofs + sizeof(struct dsa_raw_desc)) & IDXD_PORTAL_MASK;
return wq->portal + ofs;
}
static inline void idxd_wq_get(struct idxd_wq *wq)
{
wq->client_count++;
}
static inline void idxd_wq_put(struct idxd_wq *wq)
{
wq->client_count--;
}
static inline int idxd_wq_refcount(struct idxd_wq *wq)
{
return wq->client_count;
};
int __must_check __idxd_driver_register(struct idxd_device_driver *idxd_drv,
struct module *module, const char *mod_name);
#define idxd_driver_register(driver) \
__idxd_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
void idxd_driver_unregister(struct idxd_device_driver *idxd_drv);
#define module_idxd_driver(__idxd_driver) \
module_driver(__idxd_driver, idxd_driver_register, idxd_driver_unregister)
int idxd_register_bus_type(void);
void idxd_unregister_bus_type(void);
int idxd_register_devices(struct idxd_device *idxd);
void idxd_unregister_devices(struct idxd_device *idxd);
int idxd_register_driver(void);
void idxd_unregister_driver(void);
void idxd_wqs_quiesce(struct idxd_device *idxd);
/* device interrupt control */
void idxd_msix_perm_setup(struct idxd_device *idxd);
void idxd_msix_perm_clear(struct idxd_device *idxd);
irqreturn_t idxd_misc_thread(int vec, void *data);
irqreturn_t idxd_wq_thread(int irq, void *data);
void idxd_mask_error_interrupts(struct idxd_device *idxd);
void idxd_unmask_error_interrupts(struct idxd_device *idxd);
void idxd_mask_msix_vectors(struct idxd_device *idxd);
void idxd_mask_msix_vector(struct idxd_device *idxd, int vec_id);
void idxd_unmask_msix_vector(struct idxd_device *idxd, int vec_id);
/* device control */
int idxd_register_idxd_drv(void);
void idxd_unregister_idxd_drv(void);
int idxd_device_drv_probe(struct idxd_dev *idxd_dev);
void idxd_device_drv_remove(struct idxd_dev *idxd_dev);
int drv_enable_wq(struct idxd_wq *wq);
dmaengine: idxd: create dmaengine driver for wq 'device' The original architecture of /sys/bus/dsa invented a scheme whereby a single entry in the list of bus drivers, /sys/bus/drivers/dsa, handled all device types and internally routed them to different drivers. Those internal drivers were invisible to userspace. Now, as /sys/bus/dsa wants to grow support for alternate drivers for a given device, for example vfio-mdev instead of kernel-internal-dmaengine, a proper bus device-driver model is needed. The first step in that process is separating the existing omnibus/implicit "dsa" driver into proper individual drivers registered on /sys/bus/dsa. Establish the idxd_dmaengine_drv driver that controls the enabling and disabling of the wq and also register and unregister the dma channel. idxd_wq_alloc_resources() and idxd_wq_free_resources() also get moved to the dmaengine driver. The resources (dma descriptors allocation and setup) are only used by the dmaengine driver and should only happen when it loads. The char dev driver (cdev) related bits are left in the __drv_enable_wq() and __drv_disable_wq() calls to be moved when we split out the char dev driver just like how the dmaengine driver is split out. WQ autoload support is not expected currently. With the amount of configuration needed for the device, the wq is always expected to be enabled by a tool (or via sysfs) rather than auto enabled at driver load. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/162637467033.744545.12330636655625405394.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-07-16 02:44:30 +08:00
int __drv_enable_wq(struct idxd_wq *wq);
void drv_disable_wq(struct idxd_wq *wq);
dmaengine: idxd: create dmaengine driver for wq 'device' The original architecture of /sys/bus/dsa invented a scheme whereby a single entry in the list of bus drivers, /sys/bus/drivers/dsa, handled all device types and internally routed them to different drivers. Those internal drivers were invisible to userspace. Now, as /sys/bus/dsa wants to grow support for alternate drivers for a given device, for example vfio-mdev instead of kernel-internal-dmaengine, a proper bus device-driver model is needed. The first step in that process is separating the existing omnibus/implicit "dsa" driver into proper individual drivers registered on /sys/bus/dsa. Establish the idxd_dmaengine_drv driver that controls the enabling and disabling of the wq and also register and unregister the dma channel. idxd_wq_alloc_resources() and idxd_wq_free_resources() also get moved to the dmaengine driver. The resources (dma descriptors allocation and setup) are only used by the dmaengine driver and should only happen when it loads. The char dev driver (cdev) related bits are left in the __drv_enable_wq() and __drv_disable_wq() calls to be moved when we split out the char dev driver just like how the dmaengine driver is split out. WQ autoload support is not expected currently. With the amount of configuration needed for the device, the wq is always expected to be enabled by a tool (or via sysfs) rather than auto enabled at driver load. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/162637467033.744545.12330636655625405394.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-07-16 02:44:30 +08:00
void __drv_disable_wq(struct idxd_wq *wq);
int idxd_device_init_reset(struct idxd_device *idxd);
int idxd_device_enable(struct idxd_device *idxd);
int idxd_device_disable(struct idxd_device *idxd);
void idxd_device_reset(struct idxd_device *idxd);
void idxd_device_clear_state(struct idxd_device *idxd);
int idxd_device_config(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
int idxd_device_load_config(struct idxd_device *idxd);
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
enum idxd_interrupt_type irq_type);
int idxd_device_release_int_handle(struct idxd_device *idxd, int handle,
enum idxd_interrupt_type irq_type);
/* work queue control */
void idxd_wqs_unmap_portal(struct idxd_device *idxd);
int idxd_wq_alloc_resources(struct idxd_wq *wq);
void idxd_wq_free_resources(struct idxd_wq *wq);
int idxd_wq_enable(struct idxd_wq *wq);
int idxd_wq_disable(struct idxd_wq *wq, bool reset_config);
void idxd_wq_drain(struct idxd_wq *wq);
void idxd_wq_reset(struct idxd_wq *wq);
int idxd_wq_map_portal(struct idxd_wq *wq);
void idxd_wq_unmap_portal(struct idxd_wq *wq);
int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
int idxd_wq_disable_pasid(struct idxd_wq *wq);
void idxd_wq_quiesce(struct idxd_wq *wq);
int idxd_wq_init_percpu_ref(struct idxd_wq *wq);
/* submission */
int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc);
struct idxd_desc *idxd_alloc_desc(struct idxd_wq *wq, enum idxd_op_type optype);
void idxd_free_desc(struct idxd_wq *wq, struct idxd_desc *desc);
/* dmaengine */
int idxd_register_dma_device(struct idxd_device *idxd);
void idxd_unregister_dma_device(struct idxd_device *idxd);
int idxd_register_dma_channel(struct idxd_wq *wq);
void idxd_unregister_dma_channel(struct idxd_wq *wq);
void idxd_parse_completion_status(u8 status, enum dmaengine_tx_result *res);
void idxd_dma_complete_txd(struct idxd_desc *desc,
enum idxd_complete_type comp_type);
/* cdev */
int idxd_cdev_register(void);
void idxd_cdev_remove(void);
int idxd_cdev_get_major(struct idxd_device *idxd);
int idxd_wq_add_cdev(struct idxd_wq *wq);
void idxd_wq_del_cdev(struct idxd_wq *wq);
dmaengine: idxd: Add IDXD performance monitor support Implement the IDXD performance monitor capability (named 'perfmon' in the DSA (Data Streaming Accelerator) spec [1]), which supports the collection of information about key events occurring during DSA and IAX (Intel Analytics Accelerator) device execution, to assist in performance tuning and debugging. The idxd perfmon support is implemented as part of the IDXD driver and interfaces with the Linux perf framework. It has several features in common with the existing uncore pmu support: - it does not support sampling - does not support per-thread counting However it also has some unique features not present in the core and uncore support: - all general-purpose counters are identical, thus no event constraints - operation is always system-wide While the core perf subsystem assumes that all counters are by default per-cpu, the uncore pmus are socket-scoped and use a cpu mask to restrict counting to one cpu from each socket. IDXD counters use a similar strategy but expand the scope even further; since IDXD counters are system-wide and can be read from any cpu, the IDXD perf driver picks a single cpu to do the work (with cpu hotplug notifiers to choose a different cpu if the chosen one is taken off-line). More specifically, the perf userspace tool by default opens a counter for each cpu for an event. However, if it finds a cpumask file associated with the pmu under sysfs, as is the case with the uncore pmus, it will open counters only on the cpus specified by the cpumask. Since perfmon only needs to open a single counter per event for a given IDXD device, the perfmon driver will create a sysfs cpumask file for the device and insert the first cpu of the system into it. When a user uses perf to open an event, perf will open a single counter on the cpu specified by the cpu mask. This amounts to the default system-wide rather than per-cpu counting mentioned previously for perfmon pmu events. In order to keep the cpu mask up-to-date, the driver implements cpu hotplug support for multiple devices, as IDXD usually enumerates and registers more than one idxd device. The perfmon driver implements basic perfmon hardware capability discovery and configuration, and is initialized by the IDXD driver's probe function. During initialization, the driver retrieves the total number of supported performance counters, the pmu ID, and the device type from idxd device, and registers itself under the Linux perf framework. The perf userspace tool can be used to monitor single or multiple events depending on the given configuration, as well as event groups, which are also supported by the perfmon driver. The user configures events using the perf tool command-line interface by specifying the event and corresponding event category, along with an optional set of filters that can be used to restrict counting to specific work queues, traffic classes, page and transfer sizes, and engines (See [1] for specifics). With the configuration specified by the user, the perf tool issues a system call passing that information to the kernel, which uses it to initialize the specified event(s). The event(s) are opened and started, and following termination of the perf command, they're stopped. At that point, the perfmon driver will read the latest count for the event(s), calculate the difference between the latest counter values and previously tracked counter values, and display the final incremental count as the event count for the cycle. An overflow handler registered on the IDXD irq path is used to account for counter overflows, which are signaled by an overflow interrupt. Below are a couple of examples of perf usage for monitoring DSA events. The following monitors all events in the 'engine' category. Becuuse no filters are specified, this captures all engine events for the workload, which in this case is 19 iterations of the work generated by the kernel dmatest module. Details describing the events can be found in Appendix D of [1], Performance Monitoring Events, but briefly they are: event 0x1: total input data processed, in 32-byte units event 0x2: total data written, in 32-byte units event 0x4: number of work descriptors that read the source event 0x8: number of work descriptors that write the destination event 0x10: number of work descriptors dispatched from batch descriptors event 0x20: number of work descriptors dispatched from work queues # perf stat -e dsa0/event=0x1,event_category=0x1/, dsa0/event=0x2,event_category=0x1/, dsa0/event=0x4,event_category=0x1/, dsa0/event=0x8,event_category=0x1/, dsa0/event=0x10,event_category=0x1/, dsa0/event=0x20,event_category=0x1/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 5,332 dsa0/event=0x1,event_category=0x1/ 5,327 dsa0/event=0x2,event_category=0x1/ 19 dsa0/event=0x4,event_category=0x1/ 19 dsa0/event=0x8,event_category=0x1/ 0 dsa0/event=0x10,event_category=0x1/ 19 dsa0/event=0x20,event_category=0x1/ 21.977436186 seconds time elapsed The command below illustrates filter usage with a simple example. It specifies that MEM_MOVE operations should be counted for the DSA device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number of Memory Move Descriptors, which is part of event category 0x3 - Operations. The detailed category and event IDs are available in Appendix D, Performance Monitoring Events, of [1]). In addition to the event and event category, a number of filters are also specified (the detailed filter values are available in Chapter 6.4 (Filter Support) of [1]), which will restrict counting to only those events that meet all of the filter criteria. In this case, the filters specify that only MEM_MOVE operations that are serviced by work queue wq0 and specifically engine number engine0 and traffic class tc0 having sizes between 0 and 4k and page size of between 0 and 1G result in a counter hit; anything else will be filtered out and not appear in the final count. Note that filters are optional - any filter not specified is assumed to be all ones and will pass anything. # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ modprobe dmatest channel=dma0chan0 timeout=2000 iterations=19 run=1 wait=1 Performance counter stats for 'system wide': 19 dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7, filter_eng=0x1,event=0x8,event_category=0x3/ 21.865914091 seconds time elapsed The output above reflects that the unspecified workload resulted in the counting of 19 MEM_MOVE operation events that met the filter criteria. [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html [ Based on work originally by Jing Lin. ] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-04-24 23:04:15 +08:00
/* perfmon */
#if IS_ENABLED(CONFIG_INTEL_IDXD_PERFMON)
int perfmon_pmu_init(struct idxd_device *idxd);
void perfmon_pmu_remove(struct idxd_device *idxd);
void perfmon_counter_overflow(struct idxd_device *idxd);
void perfmon_init(void);
void perfmon_exit(void);
#else
static inline int perfmon_pmu_init(struct idxd_device *idxd) { return 0; }
static inline void perfmon_pmu_remove(struct idxd_device *idxd) {}
static inline void perfmon_counter_overflow(struct idxd_device *idxd) {}
static inline void perfmon_init(void) {}
static inline void perfmon_exit(void) {}
#endif
dmaengine: idxd: fix submission race window Konstantin observed that when descriptors are submitted, the descriptor is added to the pending list after the submission. This creates a race window with the slight possibility that the descriptor can complete before it gets added to the pending list and this window would cause the completion handler to miss processing the descriptor. To address the issue, the addition of the descriptor to the pending list must be done before it gets submitted to the hardware. However, submitting to swq with ENQCMDS instruction can cause a failure with the condition of either wq is full or wq is not "active". With the descriptor allocation being the gate to the wq capacity, it is not possible to hit a retry with ENQCMDS submission to the swq. The only possible failure can happen is when wq is no longer "active" due to hw error and therefore we are moving towards taking down the portal. Given this is a rare condition and there's no longer concern over I/O performance, the driver can walk the completion lists in order to retrieve and abort the descriptor. The error path will set the descriptor to aborted status. It will take the work list lock to prevent further processing of worklist. It will do a delete_all on the pending llist to retrieve all descriptors on the pending llist. The delete_all action does not require a lock. It will walk through the acquired llist to find the aborted descriptor while add all remaining descriptors to the work list since it holds the lock. If it does not find the aborted descriptor on the llist, it will walk through the work list. And if it still does not find the descriptor, then it means the interrupt handler has removed the desc from the llist but is pending on the work list lock and will process it once the error path releases the lock. Fixes: eb15e7154fbf ("dmaengine: idxd: add interrupt handle request and release support") Reported-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/162628855747.360485.10101925573082466530.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2021-07-15 02:50:06 +08:00
static inline void complete_desc(struct idxd_desc *desc, enum idxd_complete_type reason)
{
idxd_dma_complete_txd(desc, reason);
idxd_free_desc(desc->wq, desc);
}
#endif