The temperature registers appear to report values in degrees Celsius
while the hwmon API mandates values to be exposed in millidegrees
Celsius. Do the conversion so that the values reported by "sensors"
are correct.
Fixes: aed93e0bf4 ("tg3: Add hwmon support for temperature")
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: stable@vger.kernel.org [v3.6+]
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8b63ec1837 ("phylib: Make PHYs children of their MDIO bus, not
the bus' parent.") uncovered a problem in mdiobus_unregister() which
leads to this warning when I reboot an APM Mustang (arm64) platform:
WARNING: CPU: 7 PID: 4239 at fs/sysfs/group.c:224 sysfs_remove_group+0xa0/0xa4()
sysfs group fffffe0000e07a10 not found for kobject 'xgene-mii-eth0:03'
...
CPU: 7 PID: 4239 Comm: reboot Tainted: G E 4.2.0-0.18.el7.test15.aarch64 #1
Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0 Aug 26 2015
Call Trace:
[<fffffe000009739c>] dump_backtrace+0x0/0x170
[<fffffe000009752c>] show_stack+0x20/0x2c
[<fffffe00007436f0>] dump_stack+0x78/0x9c
[<fffffe00000c2cb4>] warn_slowpath_common+0xa0/0xd8
[<fffffe00000c2d60>] warn_slowpath_fmt+0x74/0x88
[<fffffe0000293d3c>] sysfs_remove_group+0x9c/0xa4
[<fffffe00004a8bac>] dpm_sysfs_remove+0x5c/0x70
[<fffffe000049b388>] device_del+0x44/0x208
[<fffffe000049b578>] device_unregister+0x2c/0x7c
[<fffffe000050dc68>] mdiobus_unregister+0x48/0x94
[<fffffe000052afd0>] xgene_enet_mdio_remove+0x28/0x44
[<fffffe000052d3f0>] xgene_enet_remove+0xd0/0xd8
[<fffffe000052d424>] xgene_enet_shutdown+0x2c/0x3c
[<fffffe00004a204c>] platform_drv_shutdown+0x24/0x40
[<fffffe000049d4f4>] device_shutdown+0xf0/0x1b4
[<fffffe00000e31ec>] kernel_restart_prepare+0x40/0x4c
[<fffffe00000e32f8>] kernel_restart+0x1c/0x80
[<fffffe00000e3670>] SyS_reboot+0x17c/0x250
The problem is that mdiobus_unregister() deletes the bus device before
unregistering the phy devices on the bus. This wasn't a problem before
because the phys were not children of the bus:
/sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0:03
/sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0
But now that they are:
/sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0/xgene-mii-eth0:03
when mdiobus_unregister deletes the bus device, the phy subdirs are
removed from sysfs also. So when the phys are unregistered afterward,
we get the warning. This patch changes the order so that phys are
unregistered before the bus device is deleted.
Fixes: 8b63ec1837 ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.")
Signed-off-by: Mark Salter <msalter@redhat.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Mark Langsdorf <mlangsdo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of VRF patches used 'int' for table id. It should be u32 to be
consistent with the rest of the stack.
Fixes:
4e3c89920c ("net: Introduce VRF related flags and helpers")
15be405eb2 ("net: Add inet_addr lookup by table")
30bbaa1950 ("net: Fix up inet_addr_type checks")
021dd3b8a1 ("net: Add routes to the table associated with the device")
dc028da54e ("inet: Move VRF table lookup to inlined function")
f6d3c19274 ("net: FIB tracepoints")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
opts_size is only written and never read. Following patch
removes this unused variable.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As David pointed out, spinlock are no longer needed
to protect the per cpu queues used in gro cells infrastructure.
Also use new napi_complete_done() API so that gro_flush_timeout
tweaks have an effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Other Sierra Wireless MC73xx devices exist, with different USB IDs.
Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel is applying the RA managed/otherconf flags silently and
forgets to send ifinfo notify to inform about their change when the
router provides a zero reachable_time and retrans_timer as dnsmasq
and many routers send it, which just means unspecified by this router
and the host should continue using whatever value it is already using.
Userspace may monitor the ifinfo notifications to activate dhcpv6.
Signed-off-by: Marius Tomaschewski <mt@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn says:
====================
DSA port configuration and status
This patchset allows various switch port settings to be configured and
port status to be sampled. Some of these patches have been posted
before.
The first three patches provide infrastructure for configuring a
switch ports link speed and duplex from a fixed_link phy.
Patch four then uses this infrastructure to allow the CPU and DSA
ports of a switch to be configured using a fixed-link property in the
device tree.
Patches five and six allow a phy-mode property to be specified in the
device tree, and allow this to be used for configuring RGMII delays.
Patches seven through nine allow link status, for example that of an
SFP module, to be read from a gpio.
Changes since v1:
Rewrite 9/9 so that it hopefully does not regression on
868a4215be ("net: phy: fixed_phy: handle link-down case")
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
What features a phy supports is masked in genphy_config_init() by
looking at the PHYs BMSR register.
If the link is down, fixed_phy_update_regs() will only set the auto-
negotiation capable bit in BMSR. Thus genphy_config_init() comes to
the conclusion the PHY can only perform 10/Half, and masks out the
higher speed features. If however the link it up, BMSR is set to
indicate the speed the PHY is capable of auto-negotiating, and
genphy_config_init() does not mask out the high speed features.
To fix this, when the link is down, have fixed_phy_update_regs() leave
the link status, auto-negotiation complete, and link partner
capabilities unset, but set all the local capabilities depending on
the fixed phy speed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
An SFP module may have a link up/down status pin which can be
connection to a GPIO line of the host. Add support for reading such an
GPIO in the fixed_phy driver.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When polling for link status, don't consider ports which have a forced
link. Such ports don't monitor their phy or may not even have a phy.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some Marvell switches allow the RGMII Rx and Tx clock to be delayed
when the port is using RGMII. Have the adjust_link function look at
the phy interface type and enable this delay as requested.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can be useful for DSA and CPU ports to have a phy-mode property, in
particular to specify RGMII delays. Parse the property and set it in
the fixed-link phydev.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, DSA and CPU ports are configured to the maximum speed the
switch supports. However there can be use cases where the peer devices
port is slower. Allow a fixed-link property to be used with the DSA
and CPU port in the device tree, and use this information to configure
the port.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the supported field of the phydev to indicate the speed features
of the phy. If the phy is never attached to a netdev, but used in an
adjust_link() function, the speed will be incorrectly evaluated to
10/half rather than the correct speed/duplex.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code sets user ports to perform auto negotiation using the
phy. CPU and DSA ports are configured to full duplex and maximum speed
the switch supports.
There are however use cases where the CPU has a slower port, and when
user ports have SFP modules with fixed speed. In these cases, port
settings to be read from a fixed_phy devices. The switch driver then
needs to implement the adjust_link op, so the port settings can be
set.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some Ethernet MAC drivers using the PHY library require the hardcoding
of link parameters when interfaced to a switch device, SFP module,
switch to switch port, etc. This has typically lead to various ad-hoc
implementations looking like this:
- using a "fixed PHY" emulated device, which will provide link
indication towards the Ethernet MAC driver and hardware
- pretend there is no PHY and hardcode link parameters, ala mv643x_eth
Based on that, it is desireable to have the PHY drivers advertise the
correct link parameters, just like regular Ethernet PHYs towards their
CPU Ethernet MAC drivers, however, Ethernet MAC drivers should be able
to tell whether this link should be monitored or not. In the context
of an Ethernet switch, SFP module, switch to switch link, we do not
need to monitor this link since it should be always up.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a memory leak in the mpls netns init function in case of failure. If
register_net_sysctl fails then we need to free the ctl_table.
Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TOS is another key aspect of the lookup passed to fib_validate_source.
Add it to the tracepoint.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To fix build errors:
kernel/built-in.o: In function `bpf_trace_printk':
bpf_trace.c:(.text+0x11a254): undefined reference to `strncpy_from_unsafe'
kernel/built-in.o: In function `fetch_memory_string':
trace_kprobe.c:(.text+0x11acf8): undefined reference to `strncpy_from_unsafe'
move strncpy_from_unsafe() next to probe_kernel_read/write()
which use the same memory access style.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 1a6877b9c0 ("lib: introduce strncpy_from_unsafe()")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
tcp: receive-side per route dctcp handling
Original cover letter:
Currently, the following case doesn't use DCTCP, even if it should:
- responder has f.e. cubic as system wide default
- 'ip route congctl dctcp $src' was set
Then, DCTCP is NOT used if a DCTCP sender attempts to connect from a
host in the $src range: ECT(0) is set, but listen_sk is not dctcp, so
we fail the INET_ECN_is_not_ect sanity check.
We also have to examine the dst used for the SYN/ACK reply to make
this case work.
In order to minimize additional cost, store the 'ecn is must have'
information is the dst_features field.
The set targets -next instead of -net since this doesn't seem to be a
serious bug and to give the change more soak time until it hits linus
tree.
v1 -> v2:
- Addressed Dave's feedback, not exposing any bits to user space
- Added patch 3 to reject incorrect configurations
- Rest as is, rebased and retested
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.
We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().
Joint work with Florian Westphal.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Feature bits that are invalid should not be accepted by the kernel,
only the lower 4 bits may be configured, but not the remaining ones.
Even from these 4, 2 of them are unused.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce the identation a bit, there's no need to artificically have
it increased.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_create_info() is already quite large, so before adding more
code to the metrics section move that to a helper, similar to
ip6_convert_metrics.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Document the addition of a new sysctl variable which controls the
generation of IGMP reports for link local multicast groups in the
224.0.0.X range.
IGMP reports for local multicast groups can now be optionally
inhibited by setting the value to zero e.g.:
echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports
To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero.
Signed-off-by: Philip Downey <pdowney@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/af_inet.c: In function 'snmp_get_cpu_field64':
>> net/ipv4/af_inet.c:1486:26: error: 'offt' undeclared (first use in this function)
v = *(((u64 *)bhptr) + offt);
^
net/ipv4/af_inet.c:1486:26: note: each undeclared identifier is reported only once for each function it appears in
net/ipv4/af_inet.c: In function 'snmp_fold_field64':
>> net/ipv4/af_inet.c:1499:39: error: 'offct' undeclared (first use in this function)
res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
^
>> net/ipv4/af_inet.c:1499:10: error: too many arguments to function 'snmp_get_cpu_field'
res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
^
net/ipv4/af_inet.c:1455:5: note: declared here
u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
^
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Poll() returns immediately after setting the kernel current frame
(ring->head) to SKIP from user space even though there is no new
frame. And in a case of all frames is VALID, user space program
unintensionally sets (only) kernel current frame to UNUSED, then
calls poll(), it will not return immediately even though there are
VALID frames.
To avoid situations like above, I think we need to scan all frames
to find VALID frames at poll() like netlink_alloc_skb(),
netlink_forward_ring() finding an UNUSED frame at skb allocation.
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Aleksey Makarov says:
====================
net: thunderx: New features and fixes
v2:
- The unused affinity_mask field of the structure cmp_queue
has been deleted. (thanks to David Miller)
- The unneeded initializers have been dropped. (thanks to Alexey Klimov)
- The commit message "net: thunderx: Rework interrupt handling"
has been fixed. (thanks to Alexey Klimov)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for setting VF's corresponding BGX LMAC in internal
loopback mode. This mode can be used for verifying basic HW
functionality such as packet I/O, RX checksum validation,
CQ/RBDR interrupts, stats e.t.c. Useful when DUT has no external
network connectivity.
'loopback' mode can be enabled or disabled via ethtool.
Note: This feature is not supported when no of VFs enabled are
morethan no of physical interfaces i.e active BGX LMACs
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for handling multiple qsets assigned to a
single VF. There by increasing no of queues from earlier 8 to max
no of CPUs in the system i.e 48 queues on a single node and 96 on
dual node system. User doesn't have option to assign which Qsets/VFs
to be merged. Upon request from VF, PF assigns next free Qsets as
secondary qsets. To maintain current behavior no of queues is kept
to 8 by default which can be increased via ethtool.
If user wants to unbind NICVF driver from a secondary Qset then it
should be done after tearing down primary VF's interface.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework interrupt handler to avoid checking IRQ affinity of
CQ interrupts. Now separate handlers are registered for each IRQ
including RBDR. Register interrupt handlers for only those
which are being used. Add nicvf_dump_intr_status() and use it
in irq handlers.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch configures HW to strip 802.1Q header if found in a
receiving packet. The stripped VLAN ID and TCI information is
passed on to software via CQE_RX. Also sets netdev's 'vlan_features'
so that other HW offload features can be used for tagged packets.
This offload feature can be enabled or disabled via ethtool.
Network stack normally ignores RPS for 802.1Q packets and hence low
throughput. With this offload enabled throughput for tagged packets
will be almost same as normal packets.
Note: This patch doesn't enable HW VLAN insertion for transmit packets.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for receive hashing HW offload by using RSS_ALG
and RSS_TAG fields of CQE_RX descriptor. Also removed dependency
on minimum receive queue count to configure RSS so that hash is
always generated.
This hash is used by RPS logic to distribute flows across multiple
CPUs. Offload can be disabled via ethtool.
Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the nicvf_send_msg_to_pf() function in the mailbox code.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added ethtool support to dump receive packet error statistics reported
in CQE. Also made some small fixes
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The liquidio and thunder drivers have different maintainers.
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raghavendra K T says:
====================
Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.
The current patch tries to improve by reordering the statistics gathering.
Please note that similar overhead was also reported while creating
veth pairs https://lkml.org/lkml/2013/3/19/556
Changes in V4:
- remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
warning (Eric) also remove 'item' parameter (Joe)
- add missing memset of padding.
Changes in V3:
- use memset to initialize temp buffer in leaf function. (David)
- use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
- Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
-
Changes in V2:
- Allocate the stat calculation buffer in stack. (Eric)
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory
1000 docker containers was created with command
docker run -itd ubuntu:15.04 /bin/bash in loop
observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.
rtnl_fill_ifinfo
-> inet6_fill_link_af
-> inet6_fill_ifla6_attrs
-> snmp_fold_field
proposed idea:
currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.
Performance of docker creation improved by around more than 2x
after the patch.
before the patch:
================
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real 0m6.836s
user 0m0.095s
sys 0m0.011s
perf record -a docker run -itd ubuntu:15.04 /bin/bash
=======================================================
50.73% docker [kernel.kallsyms] [k] snmp_fold_field
9.07% swapper [kernel.kallsyms] [k] snooze_loop
3.49% docker [kernel.kallsyms] [k] veth_stats_one
2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock
1.37% docker docker [.] backtrace_qsort
1.31% docker docker [.] strings.FieldsFunc
cache-misses: 2.7%
after the patch:
=============
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real 0m3.249s
user 0m0.088s
sys 0m0.020s
perf record -a docker run -itd ubuntu:15.04 /bin/bash
=======================================================
10.57% docker docker [.] scanblock
8.37% swapper [kernel.kallsyms] [k] snooze_loop
6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field
6.67% docker [kernel.kallsyms] [k] veth_stats_one
3.96% docker docker [.] runtime_MSpan_Sweep
2.47% docker docker [.] strings.FieldsFunc
cache-misses: 1.41 %
Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for the comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).
idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.
Docker creation got faster by more than 2x after the patch.
Result:
Before After
Docker creation time 6.836s 3.25s
cache miss 2.7% 1.41%
perf before:
50.73% docker [kernel.kallsyms] [k] snmp_fold_field
9.07% swapper [kernel.kallsyms] [k] snooze_loop
3.49% docker [kernel.kallsyms] [k] veth_stats_one
2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock
perf after:
10.57% docker docker [.] scanblock
8.37% swapper [kernel.kallsyms] [k] snooze_loop
6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field
6.67% docker [kernel.kallsyms] [k] veth_stats_one
changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin B Shelar says:
====================
openvswitch: Cleanup post vport conversion.
After converting all vport to netdev implmentations there
is no need for some of vport functionality.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since all vport types are now backed by netdev, we can directly
use netdev stats. Following patch removes redundant stat
from vport.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tun info is passed using skb-dst pointer. Now we have
converted all vports to netdev based implementation so
Now we can remove redundant pointer to tun-info from OVS_CB.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unused get_name() function pointer from vport ops.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geneve can benefit from GRO at the device level in a manner similar
to other tunnels, especially as hardware offloads are still emerging.
After this patch, aggregated frames are seen on the tunnel interface.
Single stream throughput nearly doubles in ideal circumstances (on
old hardware).
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>