OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Michal Kubecek	b73229e4de	ethtool: provide link mode information with LINKMODES_GET request Implement LINKMODES_GET netlink request to get link modes related information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands. This request provides supported, advertised and peer advertised link modes, autonegotiation flag, speed and duplex. LINKMODES_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:09 +08:00
Michal Kubecek	98b94eccdc	ethtool: add LINKINFO_NTF notification Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands. The notification message has the same format as reply to LINKINFO_GET request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the notification if there is a change but the ioctl command handlers do not check if there is an actual change and trigger the notification whenever the commands are executed. As all work is done by ethnl_default_notify() handler and callback functions introduced to handle LINKINFO_GET requests, all that remains is adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where needed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:08 +08:00
Michal Kubecek	3389645601	ethtool: add default notification handler The ethtool netlink notifications have the same format as related GET replies so that if generic GET handling framework is used to process GET requests, its callbacks and instance of struct get_request_ops can be also used to compose corresponding notification message. Provide function ethnl_std_notify() to be used as notification handler in ethnl_notify_handlers table. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:08 +08:00
Michal Kubecek	b1be8ee127	ethtool: set link settings with LINKINFO_SET request Implement LINKINFO_SET netlink request to set link settings queried by LINKINFO_GET message. Only physical port, phy MDIO address and MDI(-X) control can be set, attempt to modify MDI(-X) status and transceiver is rejected. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:07 +08:00
Michal Kubecek	28ecca7733	ethtool: provide link settings with LINKINFO_GET request Implement LINKINFO_GET netlink request to get basic link settings provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands. This request provides settings not directly related to autonegotiation and link mode selection: physical port, phy MDIO address, MDI(-X) status, MDI(-X) control and transceiver. LINKINFO_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:07 +08:00
Michal Kubecek	95f6a95c03	ethtool: provide string sets with STRSET_GET request Requests a contents of one or more string sets, i.e. indexed arrays of strings; this information is provided by ETHTOOL_GSSET_INFO and ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all information can be retrieved with one request and mulitple string sets can be requested at once. There are three types of requests: - no NLM_F_DUMP, no device: get "global" stringsets - no NLM_F_DUMP, with device: get string sets related to the device - NLM_F_DUMP, no device: get device related string sets for all devices Client can request either all string sets of given type (global or device related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only set sizes (numbers of strings) are returned. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:07 +08:00
Michal Kubecek	ba33d06ec7	ethtool: default handlers for GET requests Significant part of GET request processing is common for most request types but unfortunately it cannot be easily separated from type specific code as we need to alternate between common actions (parsing common request header, allocating message and filling netlink/genetlink headers etc.) and specific actions (querying the device, composing the reply). The processing also happens in three different situations: "do" request, "dump" request and notification, each doing things in slightly different way. The request specific code is implemented in four or five callbacks defined in an instance of struct get_request_ops: parse_request() - parse incoming message prepare_data() - retrieve data from driver or NIC reply_size() - estimate reply message size fill_reply() - compose reply message cleanup_data() - (optional) clean up additional data Other members of struct get_request_ops describe the data structure holding information from client request and data used to compose the message. The default handlers ethnl_default_doit(), ethnl_default_dumpit(), ethnl_default_start() and ethnl_default_done() can be then used in genl_ops handler. Notification handler will be introduced in a later patch. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:06 +08:00
Michal Kubecek	11b92236e5	ethtool: support for netlink notifications Add infrastructure for ethtool netlink notifications. There is only one multicast group "monitor" which is used to notify userspace about changes and actions performed. Notification messages (types using suffix _NTF) share the format with replies to GET requests. Notifications are supposed to be broadcasted on every configuration change, whether it is done using the netlink interface or ioctl one. Netlink SET requests only trigger a notification if some data is actually changed. To trigger an ethtool notification, both ethtool netlink and external code use ethtool_notify() helper. This helper requires RTNL to be held and may sleep. Handlers sending messages for specific notification message types are registered in ethnl_notify_handlers array. As notifications can be triggered from other code, ethnl_ok flag is used to prevent an attempt to send notification before genetlink family is registered. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:06 +08:00
Michal Kubecek	0d8df13691	ethtool: netlink bitset handling The ethtool netlink code uses common framework for passing arbitrary length bit sets to allow future extensions. A bitset can be a list (only one bitmap) or can consist of value and mask pair (used e.g. when client want to modify only some bits). A bitset can use one of two formats: verbose (bit by bit) or compact. Verbose format consists of bitset size (number of bits), list flag and an array of bit nests, telling which bits are part of the list or which bits are in the mask and which of them are to be set. In requests, bits can be identified by index (position) or by name. In replies, kernel provides both index and name. Verbose format is suitable for "one shot" applications like standard ethtool command as it avoids the need to either keep bit names (e.g. link modes) in sync with kernel or having to add an extra roundtrip for string set request (e.g. for private flags). Compact format uses one (list) or two (value/mask) arrays of 32-bit words to store the bitmap(s). It is more suitable for long running applications (ethtool in monitor mode or network management daemons) which can retrieve the names once and then pass only compact bitmaps to save space. Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS flag in request header tells kernel which format to use in reply. Notifications always use compact format. As some code uses arrays of unsigned long for internal representation and some arrays of u32 (or even a single u32), two sets of parse/compose helpers are introduced. To avoid code duplication, helpers for unsigned long arrays are implemented as wrappers around helpers for u32 arrays. There are two reasons for this choice: (1) u32 arrays are more frequent in ethtool code and (2) unsigned long array can be always interpreted as an u32 array on little endian 64-bit and all 32-bit architectures while we would need special handling for odd number of u32 words in the opposite direction. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:06 +08:00
Michal Kubecek	e9567a771a	ethtool: helper functions for netlink interface Add common request/reply header definition and helpers to parse request header and fill reply header. Provide ethnl_update_* helpers to update structure members from request attributes (to be used for *_SET requests). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com> add extra define for ALTIFNAMSIZ in net/ethtool/netlink.c	2024-06-12 13:17:05 +08:00
Michal Kubecek	94a483279b	ethtool: introduce ethtool netlink interface Basic genetlink and init infrastructure for the netlink interface, register genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to make the build optional. Add initial overall interface description into Documentation/networking/ethtool-netlink.rst, further patches will add more detailed information. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:05 +08:00
Richard Cochran	7b71c74c53	net: Introduce peer to peer one step PTP time stamping. The 1588 standard defines one step operation for both Sync and PDelay_Resp messages. Up until now, hardware with P2P one step has been rare, and kernel support was lacking. This patch adds support of the mode in anticipation of new hardware developments. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:04 +08:00
Richard Cochran	c519d55103	net: ethtool: Use the PHY time stamping interface. The ethtool layer tests fields of the phy_device in order to determine whether to invoke the PHY's tsinfo ethtool callback. This patch replaces the open coded logic with an invocation of the proper methods. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:04 +08:00
Richard Cochran	6dd1986bfb	net: phy: Introduce helper functions for time stamping support. Some parts of the networking stack and at least one driver test fields within the 'struct phy_device' in order to query time stamping capabilities and to invoke time stamping methods. This patch adds a functional interface around the time stamping fields. This will allow insulating the callers from future changes to the details of the time stamping implemenation. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:04 +08:00
Michal Kubecek	10df70c759	ethtool: provide link mode names as a string set Unlike e.g. netdev features, the ethtool ioctl interface requires link mode table to be in sync between kernel and userspace for userspace to be able to display and set all link modes supported by kernel. The way arbitrary length bitsets are implemented in netlink interface, this will be no longer needed. To allow userspace to access all link modes running kernel supports, add table of ethernet link mode names and make it available as a string set to userspace GET_STRSET requests. Add build time check to make sure names are defined for all modes declared in enum ethtool_link_mode_bit_indices. Once the string set is available, make it also accessible via ioctl. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:03 +08:00
Michal Kubecek	7613b224e2	ethtool: move string arrays into common file Introduce file net/ethtool/common.c for code shared by ioctl and netlink ethtool interface. Move name tables of features, RSS hash functions, tunables and PHY tunables into this file. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:03 +08:00
Michal Kubecek	a2c22c94f8	ethtool: move to its own directory The ethtool netlink interface is going to be split into multiple files so that it will be more convenient to put all of them in a separate directory net/ethtool. Start by moving current ethtool.c with ioctl interface into this directory and renaming it to ioctl.c. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:03 +08:00
Michal Kubecek	124e8de362	netlink: rename nl80211_validate_nested() to nla_validate_nested() Function nl80211_validate_nested() is not specific to nl80211, it's a counterpart to nla_validate_nested_deprecated() with strict validation. For consistency with other validation and parse functions, rename it to nla_validate_nested(). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:02 +08:00
Jiri Pirko	e8aa12ed17	ethtool: Add support for 400Gbps (50Gbps per lane) link modes Add support for 400Gbps speed, link modes of 50Gbps per lane Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:02 +08:00
Yunsheng Lin	f0d06f9ac4	page_pool: use relaxed atomic for release side accounting There is no need to synchronize the account updating, so use the relaxed atomic to avoid some memory barrier in the data path. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:01 +08:00
Yunsheng Lin	09bb2f044b	page_pool: add frag page recycling support in page pool Currently page pool only support page recycling when there is only one user of the page, and the split page reusing implemented in the most driver can not use the page pool as bing-pong way of reusing requires the multi user support in page pool. Those reusing or recycling has below limitations: 1. page from page pool can only be used be one user in order for the page recycling to happen. 2. Bing-pong way of reusing in most driver does not support multi desc using different part of the same page in order to save memory. So add multi-users support and frag page recycling in page pool to overcome the above limitation. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:01 +08:00
Yunsheng Lin	3f58718971	page_pool: add interface to manipulate frag count in page pool For 32 bit systems with 64 bit dma, dma_addr[1] is used to store the upper 32 bit dma addr, those system should be rare those days. For normal system, the dma_addr[1] in 'struct page' is not used, so we can reuse dma_addr[1] for storing frag count, which means how many frags this page might be splited to. In order to simplify the page frag support in the page pool, the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate the 32 bit systems with 64 bit dma, and the page frag support in page pool is disabled for such system. The newly added page_pool_set_frag_count() is called to reserve the maximum frag count before any page frag is passed to the user. The page_pool_atomic_sub_frag_count_return() is called when user is done with the page frag. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:01 +08:00
Yunsheng Lin	8c39e5e810	page_pool: keep pp info as long as page pool owns the page Currently, page->pp is cleared and set everytime the page is recycled, which is unnecessary. So only set the page->pp when the page is added to the page pool and only clear it when the page is released from the page pool. This is also a preparation to support allocating frag page in page pool. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:00 +08:00
Yunsheng Lin	cf632b0d15	page_pool: mask the page->signature before the checking As mentioned in commit `c07aea3ef4` ("mm: add a signature in struct page"): "The page->signature field is aliased to page->lru.next and page->compound_head." And as the comment in page_is_pfmemalloc(): "lru.next has bit 1 set if the page is allocated from the pfmemalloc reserves. Callers may simply overwrite it if they do not need to preserve that information." The page->signature is OR’ed with PP_SIGNATURE when a page is allocated in page pool, see __page_pool_alloc_pages_slow(), and page->signature is checked directly with PP_SIGNATURE in page_pool_return_skb_page(), which might cause resoure leaking problem for a page from page pool if bit 1 of lru.next is set for a pfmemalloc page. What happens here is that the original pp->signature is OR'ed with PP_SIGNATURE after the allocation in order to preserve any existing bits(such as the bit 1, used to indicate a pfmemalloc page), so when those bits are present, those page is not considered to be from page pool and the DMA mapping of those pages will be left stale. As bit 0 is for page->compound_head, So mask both bit 0/1 before the checking in page_pool_return_skb_page(). And we will return those pfmemalloc pages back to the page allocator after cleaning up the DMA mapping. Fixes: `6a5bcd84e8` ("page_pool: Allow drivers to hint on SKB recycling") Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:00 +08:00
Ilias Apalodimas	157f28c9cc	skbuff: Fix a potential race while recycling page_pool packets mainline inclusion from mainline-v5.14-rc3 commit `2cc3aeb5ec` category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4CVS3 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5eccc ---------------------------------------------------------------------- As Alexander points out, when we are trying to recycle a cloned/expanded SKB we might trigger a race. The recycling code relies on the pp_recycle bit to trigger, which we carry over to cloned SKBs. If that cloned SKB gets expanded or if we get references to the frags, call skb_release_data() and overwrite skb->head, we are creating separate instances accessing the same page frags. Since the skb_release_data() will first try to recycle the frags, there's a potential race between the original and cloned SKB, since both will have the pp_recycle bit set. Fix this by explicitly those SKBs not recyclable. The atomic_sub_return effectively limits us to a single release case, and when we are calling skb_release_data we are also releasing the option to perform the recycling, or releasing the pages from the page pool. Fixes: `6a5bcd84e8` ("page_pool: Allow drivers to hint on SKB recycling") Reported-by: Alexander Duyck <alexanderduyck@fb.com> Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Yongxin Li <liyongxin1@huawei.com> Signed-off-by: Junxin Chen <chenjunxin1@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:17:00 +08:00
Ilias Apalodimas	22a87d5c30	page_pool: Allow drivers to hint on SKB recycling Up to now several high speed NICs have custom mechanisms of recycling the allocated memory they use for their payloads. Our page_pool API already has recycling capabilities that are always used when we are running in 'XDP mode'. So let's tweak the API and the kernel network stack slightly and allow the recycling to happen even during the standard operation. The API doesn't take into account 'split page' policies used by those drivers currently, but can be extended once we have users for that. The idea is to be able to intercept the packet on skb_release_data(). If it's a buffer coming from our page_pool API recycle it back to the pool for further usage or just release the packet entirely. To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and a field in struct page (page->pp) to store the page_pool pointer. Storing the information in page->pp allows us to recycle both SKBs and their fragments. We could have skipped the skb bit entirely, since identical information can bederived from struct page. However, in an effort to affect the free path as less as possible, reading a single bit in the skb which is already in cache, is better that trying to derive identical information for the page stored data. The driver or page_pool has to take care of the sync operations on it's own during the buffer recycling since the buffer is, after opting-in to the recycling, never unmapped. Since the gain on the drivers depends on the architecture, we are not enabling recycling by default if the page_pool API is used on a driver. In order to enable recycling the driver must call skb_mark_for_recycle() to store the information we need for recycling in page->pp and enabling the recycling bit, or page_pool_store_mem_info() for a fragment. Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com> Conflicts: include/linux/skbuff.h	2024-06-12 13:16:59 +08:00
Matteo Croce	198e415825	skbuff: add a parameter to __skb_frag_unref This is a prerequisite patch, the next one is enabling recycling of skbs and fragments. Add an extra argument on __skb_frag_unref() to handle recycling, and update the current users of the function with that. Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:59 +08:00
Matteo Croce	9483d0fd8c	mm: add a signature in struct page This is needed by the page_pool to avoid recycling a page not allocated via page_pool. The page->signature field is aliased to page->lru.next and page->compound_head, but it can't be set by mistake because the signature value is a bad pointer, and can't trigger a false positive in PageTail() because the last bit is 0. Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:58 +08:00
Alexander Lobakin	85268db1ff	net: page_pool: simplify page recycling condition tests pool_page_reusable() is a leftover from pre-NUMA-aware times. For now, this function is just a redundant wrapper over page_is_pfmemalloc(), so inline it into its sole call site. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:58 +08:00
Jonathan Lemon	f80ce27887	skbuff: Call skb_zcopy_clear() before unref'ing fragments RX zerocopy fragment pages which are not allocated from the system page pool require special handling. Give the callback in skb_zcopy_clear() a chance to process them first. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:58 +08:00
Lorenzo Bianconi	fa801e94e9	net: page_pool: Add bulk support for ptr_ring Introduce the capability to batch page_pool ptr_ring refill since it is usually run inside the driver NAPI tx completion loop. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/bpf/08dd249c9522c001313f520796faa777c4089e1c.1605267335.git.lorenzo@kernel.org Reviewed-by: Yongxin Li <liyongxin1@huawei.com> (fix conflicts: delete modifications to net/core/xdp.c) Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:57 +08:00
Jesper Dangaard Brouer	c76490bffc	net: page_pool: use alloc_pages_bulk in refill code path There are cases where the page_pool need to refill with pages from the page allocator. Some workloads cause the page_pool to release pages instead of recycling these pages. For these workload it can improve performance to bulk alloc pages from the page-allocator to refill the alloc cache. For XDP-redirect workload with 100G mlx5 driver (that use page_pool) redirecting xdp_frame packets into a veth, that does XDP_PASS to create an SKB from the xdp_frame, which then cannot return the page to the page_pool. Performance results under GitHub xdp-project[1]: [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org Mel: The patch "net: page_pool: convert to use alloc_pages_bulk_array variant" was squashed with this patch. From the test page, the array variant was superior with one of the test results as follows. Kernel XDP stats CPU pps Delta Baseline XDP-RX CPU total 3,771,046 n/a List XDP-RX CPU total 3,940,242 +4.49% Array XDP-RX CPU total 4,249,224 +12.68% Link: https://lkml.kernel.org/r/20210325114228.27719-10-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:57 +08:00
Jesper Dangaard Brouer	2b07aa1000	net: page_pool: refactor dma_map into own function page_pool_dma_map In preparation for next patch, move the dma mapping into its own function, as this will make it easier to follow the changes. [ilias.apalodimas: make page_pool_dma_map return boolean] Link: https://lkml.kernel.org/r/20210325114228.27719-9-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:56 +08:00
Jesper Dangaard Brouer	25a20ba135	mm/page_alloc: inline __rmqueue_pcplist When __alloc_pages_bulk() got introduced two callers of __rmqueue_pcplist exist and the compiler chooses to not inline this function. ./scripts/bloat-o-meter vmlinux-before vmlinux-inline__rmqueue_pcplist add/remove: 0/1 grow/shrink: 2/0 up/down: 164/-125 (39) Function old new delta rmqueue 2197 2296 +99 __alloc_pages_bulk 1921 1986 +65 __rmqueue_pcplist 125 - -125 Total: Before=19374127, After=19374166, chg +0.00% modprobe page_bench04_bulk loops=$((10**7)) Type:time_bulk_page_alloc_free_array - Per elem: 106 cycles(tsc) 29.595 ns (step:64) - (measurement period time:0.295955434 sec time_interval:295955434) - (invoke count:10000000 tsc_interval:1065447105) Before: - Per elem: 110 cycles(tsc) 30.633 ns (step:64) Link: https://lkml.kernel.org/r/20210325114228.27719-6-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:56 +08:00
Jesper Dangaard Brouer	2fc47ed311	mm/page_alloc: optimize code layout for __alloc_pages_bulk Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear that the code activated is suboptimal. The compiler guesses wrong and places unlikely code at the beginning. Due to the use of WARN_ON_ONCE() macro the UD2 asm instruction is added to the code, which confuse the I-cache prefetcher in the CPU. [mgorman@techsingularity.net: minor changes and rebasing] Link: https://lkml.kernel.org/r/20210325114228.27719-5-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-By: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:56 +08:00
Mel Gorman	9426bf0890	mm/page_alloc: correct return value of populated elements if bulk array is populated Dave Jones reported the following This made it into 5.13 final, and completely breaks NFSD for me (Serving tcp v3 mounts). Existing mounts on clients hang, as do new mounts from new clients. Rebooting the server back to rc7 everything recovers. The commit `b3b64ebd38` ("mm/page_alloc: do bulk array bounds check after checking populated elements") returns the wrong value if the array is already populated which is interpreted as an allocation failure. Dave reported this fixes his problem and it also passed a test running dbench over NFS. Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net Fixes: `b3b64ebd38` ("mm/page_alloc: do bulk array bounds check after checking populated elements") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Dave Jones <davej@codemonkey.org.uk> Tested-by: Dave Jones <davej@codemonkey.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [5.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:55 +08:00
Mel Gorman	36f7be8e44	mm/page_alloc: do bulk array bounds check after checking populated elements Dan Carpenter reported the following The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface to the bulk page allocator" from Apr 29, 2021, leads to the following static checker warning: mm/page_alloc.c:5338 __alloc_pages_bulk() warn: potentially one past the end of array 'page_array[nr_populated]' The problem can occur if an array is passed in that is fully populated. That potentially ends up allocating a single page and storing it past the end of the array. This patch returns 0 if the array is fully populated. Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net Fixes: `0f87d9d30f` ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Mel Gorman <mgorman@techsinguliarity.net> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:55 +08:00
Rasmus Villemoes	f8ecd6520c	mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array In the event that somebody would call this with an already fully populated page_array, the last loop iteration would do an access beyond the end of page_array. It's of course extremely unlikely that would ever be done, but this triggers my internal static analyzer. Also, if it really is not supposed to be invoked this way (i.e., with no NULL entries in page_array), the nr_populated<nr_pages check could simply be removed instead. Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk Fixes: `0f87d9d30f` ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:55 +08:00
Mel Gorman	cfaec7f64a	mm/page_alloc: add an array-based interface to the bulk page allocator The proposed callers for the bulk allocator store pages from the bulk allocator in an array. This patch adds an array-based interface to the API to avoid multiple list iterations. The page list interface is preserved to avoid requiring all users of the bulk API to allocate and manage enough storage to store the pages. [akpm@linux-foundation.org: remove now unused local `allocated'] Link: https://lkml.kernel.org/r/20210325114228.27719-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:54 +08:00
Mel Gorman	4c6160f834	mm/page_alloc: add a bulk page allocator This patch adds a new page allocator interface via alloc_pages_bulk, and __alloc_pages_bulk_nodemask. A caller requests a number of pages to be allocated and added to a list. The API is not guaranteed to return the requested number of pages and may fail if the preferred allocation zone has limited free memory, the cpuset changes during the allocation or page debugging decides to fail an allocation. It's up to the caller to request more pages in batch if necessary. Note that this implementation is not very efficient and could be improved but it would require refactoring. The intent is to make it available early to determine what semantics are required by different callers. Once the full semantics are nailed down, it can be refactored. [mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net [mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net [mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:54 +08:00
Mel Gorman	fc5ad41524	mm/page_alloc: rename alloced to allocated Patch series "Introduce a bulk order-0 page allocator with two in-tree users", v6. This series introduces a bulk order-0 page allocator with sunrpc and the network page pool being the first users. The implementation is not efficient as semantics needed to be ironed out first. If no other semantic changes are needed, it can be made more efficient. Despite that, this is a performance-related for users that require multiple pages for an operation without multiple round-trips to the page allocator. Quoting the last patch for the high-speed networking use-case Kernel XDP stats CPU pps Delta Baseline XDP-RX CPU total 3,771,046 n/a List XDP-RX CPU total 3,940,242 +4.49% Array XDP-RX CPU total 4,249,224 +12.68% Via the SUNRPC traces of svc_alloc_arg() Single page: 25.007 us per call over 532,571 calls Bulk list: 6.258 us per call over 517,034 calls Bulk array: 4.590 us per call over 517,442 calls Both potential users in this series are corner cases (NFS and high-speed networks) so it is unlikely that most users will see any benefit in the short term. Other potential other users are batch allocations for page cache readahead, fault around and SLUB allocations when high-order pages are unavailable. It's unknown how much benefit would be seen by converting multiple page allocation calls to a single batch or what difference it may make to headline performance. Light testing of my own running dbench over NFS passed. Chuck and Jesper conducted their own tests and details are included in the changelogs. Patch 1 renames a variable name that is particularly unpopular Patch 2 adds a bulk page allocator Patch 3 adds an array-based version of the bulk allocator Patches 4-5 adds micro-optimisations to the implementation Patches 6-7 SUNRPC user Patches 8-9 Network page_pool user This patch (of 9): Review feedback of the bulk allocator twice found problems with "alloced" being a counter for pages allocated. The naming was based on the API name "alloc" and was based on the idea that verbal communication about malloc tends to use the fake word "malloced" instead of the fake word mallocated. To be consistent, this preparation patch renames alloced to allocated in rmqueue_bulk so the bulk allocator and per-cpu allocator use similar names when the bulk allocator is introduced. Link: https://lkml.kernel.org/r/20210325114228.27719-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:54 +08:00
Mateusz Nosek	f78440ef9d	mm/page_alloc.c: clean code by merging two functions finalise_ac() is just 'epilogue' for 'prepare_alloc_pages'. Therefore there is no need to keep them both so 'finalise_ac' content can be merged into prepare_alloc_pages() code. It would make __alloc_pages_nodemask() cleaner when it comes to readability. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@kernel.org> Link: https://lkml.kernel.org/r/20200916110118.6537-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com> Conflicts: mm/page_alloc.c	2024-06-12 13:16:53 +08:00
Ilias Apalodimas	ae75cd7b60	net: page_pool: API cleanup and comments Functions starting with __ usually indicate those which are exported, but should not be called directly. Update some of those declared in the API and make it more readable. page_pool_unmap_page() and page_pool_release_page() were doing exactly the same thing calling __page_pool_clean_page(). Let's rename __page_pool_clean_page() to page_pool_release_page() and export it in order to show up on perf logs and get rid of page_pool_unmap_page(). Finally rename __page_pool_put_page() to page_pool_put_page() since we can now directly call it from drivers and rename the existing page_pool_put_page() to page_pool_put_full_page() since they do the same thing but the latter is trying to sync the full DMA area. This patch also updates netsec, mvneta and stmmac drivers which use those functions. Suggested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com> Conflicts: drivers/net/ethernet/marvell/mvneta.c drivers/net/ethernet/socionext/netsec.c net/core/page_pool.c Abort the change in mvneta.c and netsec.c to fix the conflicts.	2024-06-12 13:16:53 +08:00
Li RongQing	9e33173b6b	page_pool: refill page when alloc.count of pool is zero "do {} while" in page_pool_refill_alloc_cache will always refill page once whether refill is true or false, and whether alloc.count of pool is less than PP_ALLOC_CACHE_REFILL or not this is wrong, and will cause overflow of pool->alloc.cache the caller of __page_pool_get_cached should provide guarantee that pool->alloc.cache is safe to access, so in_serving_softirq should be removed as suggested by Jesper Dangaard Brouer in https://patchwork.ozlabs.org/patch/1233713/ so fix this issue by calling page_pool_refill_alloc_cache() only when pool->alloc.count is zero Fixes: `44768decb7` ("page_pool: handle page recycle for NUMA_NO_NODE condition") Signed-off-by: Li RongQing <lirongqing@baidu.com> Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:52 +08:00
Jesper Dangaard Brouer	37ca71f4f7	page_pool: help compiler remove code in case CONFIG_NUMA=n When kernel is compiled without NUMA support, then page_pool NUMA config setting (pool->p.nid) doesn't make any practical sense. The compiler cannot see that it can remove the code paths. This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA, in allocation and numa check code, which helps compiler to see the optimisation potential. It leaves update code intact to keep API the same. $ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \ net/core/page_pool.o-numa-disabled add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113) Function old new delta page_pool_create 401 398 -3 __page_pool_alloc_pages_slow 439 426 -13 page_pool_refill_alloc_cache 425 328 -97 Total: Before=3611, After=3498, chg -3.13% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:52 +08:00
Jesper Dangaard Brouer	f79fb94561	page_pool: handle page recycle for NUMA_NO_NODE condition The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE. The goal of the NUMA changes in commit `d5394610b1` ("page_pool: Don't recycle non-reusable pages"), were to have RX-pages that belongs to the same NUMA node as the CPU processing RX-packet during softirq/NAPI. As illustrated by the performance measurements. This patch moves the NAPI checks out of fast-path, and at the same time solves the NUMA_NO_NODE issue. First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used as the the preferred nid. It is only in rare situations, where e.g. NUMA zone runs dry, that page gets doesn't get allocated from preferred nid. The page_pool API allows drivers to control the nid themselves via controlling pool->p.nid. This patch moves the NAPI check to when alloc cache is refilled, via dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing pages from remote NUMA into the ptr_ring, as the dequeue/consume step will check the NUMA node. All current drivers using page_pool will alloc/refill RX-ring from same CPU running softirq/NAPI process. Drivers that control the nid explicitly, also use page_pool_update_nid when changing nid runtime. To speed up transision to new nid the alloc cache is now flushed on nid changes. This force pages to come from ptr_ring, which does the appropate nid check. For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA node, we accept that transitioning the alloc cache doesn't happen immediately. The preferred nid change runtime via consulting numa_mem_id() based on the CPU processing RX-packets. Notice, to avoid stressing the page buddy allocator and avoid doing too much work under softirq with preempt disabled, the NUMA check at ptr_ring dequeue will break the refill cycle, when detecting a NUMA mismatch. This will cause a slower transition, but its done on purpose. Fixes: `d5394610b1` ("page_pool: Don't recycle non-reusable pages") Reported-by: Li RongQing <lirongqing@baidu.com> Reported-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:52 +08:00
Lorenzo Bianconi	926f968742	net: page_pool: add the possibility to sync DMA memory for device Introduce the following parameters in order to add the possibility to sync DMA memory for device before putting allocated pages in the page_pool caches: - PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that the driver gets from page_pool will be DMA-synced-for-device according to the length provided by the device driver. Please note DMA-sync-for-CPU is still device driver responsibility - offset: DMA address offset where the DMA engine starts copying rx data - max_len: maximum DMA memory size page_pool is allowed to flush. This is currently used in __page_pool_alloc_pages_slow routine when pages are allocated from page allocator These parameters are supposed to be set by device drivers. This optimization reduces the length of the DMA-sync-for-device. The optimization is valid because pages are initially DMA-synced-for-device as defined via max_len. At RX time, the driver will perform a DMA-sync-for-CPU on the memory for the packet length. What is important is the memory occupied by packet payload, because this is the area CPU is allowed to read and modify. As we don't track cache-lines written into by the CPU, simply use the packet payload length as dma_sync_size at page_pool recycle time. This also take into account any tail-extend. Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:51 +08:00
Saeed Mahameed	0d1a0cfd7f	page_pool: Don't recycle non-reusable pages A page is NOT reusable when at least one of the following is true: 1) allocated when system was under some pressure. (page_is_pfmemalloc) 2) belongs to a different NUMA node than pool->p.nid. To update pool->p.nid users should call page_pool_update_nid(). Holding on to such pages in the pool will hurt the consumer performance when the pool migrates to a different numa node. Performance testing: XDP drop/tx rate and TCP single/multi stream, on mlx5 driver while migrating rx ring irq from close to far numa: mlx5 internal page cache was locally disabled to get pure page pool results. CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G) XDP Drop/TX single core: NUMA \| XDP \| Before \| After --------------------------------------- Close \| Drop \| 11 Mpps \| 10.9 Mpps Far \| Drop \| 4.4 Mpps \| 5.8 Mpps Close \| TX \| 6.5 Mpps \| 6.5 Mpps Far \| TX \| 3.5 Mpps \| 4 Mpps Improvement is about 30% drop packet rate, 15% tx packet rate for numa far test. No degradation for numa close tests. TCP single/multi cpu/stream: NUMA \| #cpu \| Before \| After -------------------------------------- Close \| 1 \| 18 Gbps \| 18 Gbps Far \| 1 \| 15 Gbps \| 18 Gbps Close \| 12 \| 80 Gbps \| 80 Gbps Far \| 12 \| 68 Gbps \| 80 Gbps In all test cases we see improvement for the far numa case, and no impact on the close numa case. The impact of adding a check per page is very negligible, and shows no performance degradation whatsoever, also functionality wise it seems more correct and more robust for page pool to verify when pages should be recycled, since page pool can't guarantee where pages are coming from. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:51 +08:00
Saeed Mahameed	75af77c4bd	page_pool: Add API to update numa node Add page_pool_update_nid() to be called by page pool consumers when they detect numa node changes. It will update the page pool nid value to start allocating from the new effective numa node. This is to mitigate page pool allocating pages from a wrong numa node, where the pool was originally allocated, and holding on to pages that belong to a different numa node, which causes performance degradation. For pages that are already being consumed and could be returned to the pool by the consumer, in next patch we will add a check per page to avoid recycling them back to the pool and return them to the page allocator. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com>	2024-06-12 13:16:51 +08:00
Xiongfeng Wang	3370a82d1d	asm-generic: introduce io_stop_wc() and add implementation for ARM64 For memory accesses with write-combining attributes (e.g. those returned by ioremap_wc()), the CPU may wait for prior accesses to be merged with subsequent ones. But in some situation, such wait is bad for the performance. We introduce io_stop_wc() to prevent the merging of write-combining memory accesses before this macro with those after it. We add implementation for ARM64 using DGH instruction and provide NOP implementation for other architectures. Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Suggested-by: Will Deacon <will@kernel.org> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20211221035556.60346-1-wangxiongfeng2@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: hongrongxuan <hongrongxuan@huawei.com> Conflicts: arch/arm64/include/asm/barrier.h include/asm-generic/barrier.h	2024-06-12 13:16:50 +08:00

... 5 6 7 8 9 ...

873795 Commits All Branches Search

873795 Commits

All Branches