We can potentially run into a couple of issues with the XDP
bpf_redirect_map() helper. The ri->map in the per CPU storage
can become stale in several ways, mostly due to misuse, where
we can then trigger a use after free on the map:
i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT
and running on a driver not supporting XDP_REDIRECT yet. The
ri->map on that CPU becomes stale when the XDP program is unloaded
on the driver, and a prog B loaded on a different driver which
supports XDP_REDIRECT return code. prog B would have to omit
calling to bpf_redirect_map() and just return XDP_REDIRECT, which
would then access the freed map in xdp_do_redirect() since not
cleared for that CPU.
ii) prog A is calling bpf_redirect_map(), returning a code other
than XDP_REDIRECT. prog A is then detached, which triggers release
of the map. prog B is attached which, similarly as in i), would
just return XDP_REDIRECT without having called bpf_redirect_map()
and thus be accessing the freed map in xdp_do_redirect() since
not cleared for that CPU.
iii) prog A is attached to generic XDP, calling the bpf_redirect_map()
helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is
currently not handling ri->map (will be fixed by Jesper), so it's
not being reset. Later loading a e.g. native prog B which would,
say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would
find in xdp_do_redirect() that a map was set and uses that causing
use after free on map access.
Fix thus needs to avoid accessing stale ri->map pointers, naive
way would be to call a BPF function from drivers that just resets
it to NULL for all XDP return codes but XDP_REDIRECT and including
XDP_REDIRECT for drivers not supporting it yet (and let ri->map
being handled in xdp_do_generic_redirect()). There is a less
intrusive way w/o letting drivers call a reset for each BPF run.
The verifier knows we're calling into bpf_xdp_redirect_map()
helper, so it can do a small insn rewrite transparent to the prog
itself in the sense that it fills R4 with a pointer to the own
bpf_prog. We have that pointer at verification time anyway and
R4 is allowed to be used as per calling convention we scratch
R0 to R5 anyway, so they become inaccessible and program cannot
read them prior to a write. Then, the helper would store the prog
pointer in the current CPUs struct redirect_info. Later in
xdp_do_*_redirect() we check whether the redirect_info's prog
pointer is the same as passed xdp_prog pointer, and if that's
the case then all good, since the prog holds a ref on the map
anyway, so it is always valid at that point in time and must
have a reference count of at least 1. If in the unlikely case
they are not equal, it means we got a stale pointer, so we clear
and bail out right there. Also do reset map and the owning prog
in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and
bpf_xdp_redirect() won't get mixed up, only the last call should
take precedence. A tc bpf_redirect() doesn't use map anywhere
yet, so no need to clear it there since never accessed in that
layer.
Note that in case the prog is released, and thus the map as
well we're still under RCU read critical section at that time
and have preemption disabled as well. Once we commit with the
__dev_map_insert_ctx() from xdp_do_redirect_map() and set the
map to ri->map_to_flush, we still wait for a xdp_do_flush_map()
to finish in devmap dismantle time once flush_needed bit is set,
so that is fine.
Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Support ipv6 checksum offload in sunvnet driver, from Shannon
Nelson.
2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
Dumazet.
3) Allow generic XDP to work on virtual devices, from John Fastabend.
4) Add bpf device maps and XDP_REDIRECT, which can be used to build
arbitrary switching frameworks using XDP. From John Fastabend.
5) Remove UFO offloads from the tree, gave us little other than bugs.
6) Remove the IPSEC flow cache, from Florian Westphal.
7) Support ipv6 route offload in mlxsw driver.
8) Support VF representors in bnxt_en, from Sathya Perla.
9) Add support for forward error correction modes to ethtool, from
Vidya Sagar Ravipati.
10) Add time filter for packet scheduler action dumping, from Jamal Hadi
Salim.
11) Extend the zerocopy sendmsg() used by virtio and tap to regular
sockets via MSG_ZEROCOPY. From Willem de Bruijn.
12) Significantly rework value tracking in the BPF verifier, from Edward
Cree.
13) Add new jump instructions to eBPF, from Daniel Borkmann.
14) Rework rtnetlink plumbing so that operations can be run without
taking the RTNL semaphore. From Florian Westphal.
15) Support XDP in tap driver, from Jason Wang.
16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.
17) Add Huawei hinic ethernet driver.
18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
Delalande.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
i40e: avoid NVM acquire deadlock during NVM update
drivers: net: xgene: Remove return statement from void function
drivers: net: xgene: Configure tx/rx delay for ACPI
drivers: net: xgene: Read tx/rx delay for ACPI
rocker: fix kcalloc parameter order
rds: Fix non-atomic operation on shared flag variable
net: sched: don't use GFP_KERNEL under spin lock
vhost_net: correctly check tx avail during rx busy polling
net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
rxrpc: Make service connection lookup always check for retry
net: stmmac: Delete dead code for MDIO registration
gianfar: Fix Tx flow control deactivation
cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
cxgb4: Fix pause frame count in t4_get_port_stats
cxgb4: fix memory leak
tun: rename generic_xdp to skb_xdp
tun: reserve extra headroom only when XDP is set
net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
net: dsa: bcm_sf2: Advertise number of egress queues
...
- Introduce fwnode operations for all of the separate types of
"firmware nodes" that can be handled by the device properties
framework and drop the type field from struct fwnode_handle
(Sakari Ailus, Arnd Bergmann).
- Make the device properties framework use const fwnode arguments
where possible (Sakari Ailus).
- Add a helper for the consolidated handling of node references
to the device properties framework (Sakari Ailus).
- Switch over the ACPI part of the device properties framework
to the new UUID API (Andy Shevchenko).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZrcHoAAoJEILEb/54YlRxVH4P/i7MVmWxZW1qosqt8NbI+kqu
rjxBiQ1YaPuwWiZk5LMRQWIr4Y52v+8uwoVAoQbpfkpQpxpUtIApqFGGHkOK091S
6wcwdAJv78m7dQGJZ96nQkBdw+qCUG+s9L3KMfXYiipwyG7bg4BVcs5jZcIqcZ4F
2xecG6DMn4ESwFbZyVULWyQh50tSBztaHEG6AU2T/07yXU3RNJmwAVVZzpHdtA80
mDbWcCFjcmhrpPa0Aq6MrSMjKso1zd8Es+xwYhXsIQpD1l0HhLLQ0X4veSPcPG4B
aSNEYuribpvZ2FIRti7H7gi/F+Arm9vPdc9WHbOPLOIF1z+GJKiqjBuxUrfXKPqG
v1W3f1bcApe9DfmC5z1wZBi2d7thQOzRFfc8WRrMybQ6z1MAqqe5PfAlgpMFmL22
8ZCzzXIBUsfUjVlwYBvgkKvpLioEl88otWGdhewWY6F+DZ8+vPyvrpi15P36Xgos
ijX89cvyfze3m5GW08hQ6DTOVvaFoMyucYfSo6/MBamw9fbUgiEgBfUAsQyb3sRU
8g1KrwkAX8KFmoocX/AVjvwVBaKNdYeJ9Gy6EItAPxNl+F1q6vjkO0r/VeSrO1KW
3GRqw5MZP35DD9IRo4DTAjwtNVkgIUjpG/hfB7l3PFdDxWfeiM5tf2zMExhT0nIR
h8s8mn61KZp0gpsE02FS
=0rnk
-----END PGP SIGNATURE-----
Merge tag 'devprop-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull device properties framework updates from Rafael Wysocki:
"These introduce fwnode operations for all of the separate types of
'firmware nodes' that can be handled by the device properties
framework, make the framework use const fwnode arguments all over, add
a helper for the consolidated handling of node references and switch
over the framework to the new UUID API.
Specifics:
- Introduce fwnode operations for all of the separate types of
'firmware nodes' that can be handled by the device properties
framework and drop the type field from struct fwnode_handle (Sakari
Ailus, Arnd Bergmann).
- Make the device properties framework use const fwnode arguments
where possible (Sakari Ailus).
- Add a helper for the consolidated handling of node references to
the device properties framework (Sakari Ailus).
- Switch over the ACPI part of the device properties framework to the
new UUID API (Andy Shevchenko)"
* tag 'devprop-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: device property: Switch to use new generic UUID API
device property: export irqchip_fwnode_ops
device property: Introduce fwnode_property_get_reference_args
device property: Constify fwnode property API
device property: Constify argument to pset fwnode backend
ACPI: Constify internal fwnode arguments
ACPI: Constify acpi_bus helper functions, switch to macros
ACPI: Prepare for constifying acpi_get_next_subnode() fwnode argument
device property: Get rid of struct fwnode_handle type field
ACPI: Use IS_ERR_OR_NULL() instead of non-NULL check in is_acpi_data_node()
- Drop the P-state selection algorithm based on a PID controller
from intel_pstate and make it use the same P-state selection
method (based on the CPU load) for all types of systems in the
active mode (Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to
take cross-CPU utilization updates into account and modify the
schedutil governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the
mediatek cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points
(OPP) DT bindings and update its whitelist of supported systems
(Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen,
Finley Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to
make it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number
of items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on
x86 in a more straightforward way and to use %pOF instead of
full_name (Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor
issues (Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZrcDJAAoJEILEb/54YlRx9FUQAIUKvWBAARc61ZIZXjbqZF1v
aEMOBuksFns0CMekdptSic6n4wc81E/XYMS8yDhOOMpyDzfAZsTWjmu+gKwN7w3l
E/yf/NVlhob9JZ7MqGgqD4EUFfFIaKBXPlWFdDi2rdCUXE2L8xJ7rla8i7zyZlc5
pYHfAppBbF4qUcEY4OoOVOOGRZCfMdiLXj0iZOhMX8Y6yLBRk/AjnVADYsF33hoj
gBEfomU+H0K5V8nQEp0ZFKDArPwL+oElHQj6i+nxBpGfPM5evvLXhHOyR6AsldJ5
J4YI1kMuQNSCmvHMqOTxTYyJf8Jcf3Fj4wcjwaVMVGceY1lz6McAKknnFnCqCvz+
mskn84gFCBCM8EoJDqRf0b9MQHcuRyQKM+yw4tjnR9r8yd32erb85ZWFHcPWYhCT
fZatNOwFFv2MU+2vo5J3yeUNSWIKT+uBjy+tKPbrDkUwpKZVRj3Oj+hP3Mq9NE8U
YBqltsj7tmrdA634zI8C7jfS6wF221S0fId/iPszwmPJaVn/lq8Ror7pWL5YI8U7
SCJFjiqDiGmAcQEkuWwFAQnscZkyHpO+Y3A+jfXl/izoaZETaI5+ceIHBaocm3+5
XrOOpHS3ik8EHf9ji0KFCKZ/pYDwllday3cBQPWo3sMIzpQ2lrjbqdnE1cVnBrld
OtHZAeD/jLUXuY6XW2jN
=mAiV
-----END PGP SIGNATURE-----
Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This time (again) cpufreq gets the majority of changes which mostly
are driver updates (including a major consolidation of intel_pstate),
some schedutil governor modifications and core cleanups.
There also are some changes in the system suspend area, mostly related
to diagnostics and debug messages plus some renames of things related
to suspend-to-idle. One major change here is that suspend-to-idle is
now going to be preferred over S3 on systems where the ACPI tables
indicate to do so and provide requsite support (the Low Power Idle S0
_DSM in particular). The system sleep documentation and the tools
related to it are updated too.
The rest is a few cpuidle changes (nothing major), devfreq updates,
generic power domains (genpd) framework updates and a few assorted
modifications elsewhere.
Specifics:
- Drop the P-state selection algorithm based on a PID controller from
intel_pstate and make it use the same P-state selection method
(based on the CPU load) for all types of systems in the active mode
(Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to take
cross-CPU utilization updates into account and modify the schedutil
governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points (OPP)
DT bindings and update its whitelist of supported systems (Viresh
Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to make
it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number of
items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on x86
in a more straightforward way and to use %pOF instead of full_name
(Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor issues
(Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava)"
* tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
cpuidle: Make drivers initialize polling state
cpuidle: Move polling state initialization code to separate file
cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
cpufreq: imx6q: Fix imx6sx low frequency support
cpufreq: speedstep-lib: make several arrays static, makes code smaller
PM: docs: Delete the obsolete states.txt document
PM: docs: Describe high-level PM strategies and sleep states
PM / devfreq: Fix memory leak when fail to register device
PM / devfreq: Add dependency on PM_OPP
PM / devfreq: Move private devfreq_update_stats() into devfreq
PM / devfreq: Convert to using %pOF instead of full_name
PM / AVS: rockchip-io: add io selectors and supplies for RV1108
cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
cpufreq: dt-platdev: Drop few entries from whitelist
cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
ARM: ux500: don't select CPUFREQ_DT
cpuidle: Convert to using %pOF instead of full_name
cpufreq: Convert to using %pOF instead of full_name
PM / Domains: Convert to using %pOF instead of full_name
cpufreq: Cap the default transition delay value to 10 ms
...
Here is the big char/misc driver update for 4.14-rc1.
Lots of different stuff in here, it's been an active development cycle
for some reason. Highlights are:
- updated binder driver, this brings binder up to date with what
shipped in the Android O release, plus some more changes that
happened since then that are in the Android development trees.
- coresight updates and fixes
- mux driver file renames to be a bit "nicer"
- intel_th driver updates
- normal set of hyper-v updates and changes
- small fpga subsystem and driver updates
- lots of const code changes all over the driver trees
- extcon driver updates
- fmc driver subsystem upadates
- w1 subsystem minor reworks and new features and drivers added
- spmi driver updates
Plus a smattering of other minor driver updates and fixes.
All of these have been in linux-next with no reported issues for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWa1+Ew8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yl26wCgquufNylfhxr65NbJrovduJYzRnUAniCivXg8
bePIh/JI5WxWoHK+wEbY
=hYWx
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big char/misc driver update for 4.14-rc1.
Lots of different stuff in here, it's been an active development cycle
for some reason. Highlights are:
- updated binder driver, this brings binder up to date with what
shipped in the Android O release, plus some more changes that
happened since then that are in the Android development trees.
- coresight updates and fixes
- mux driver file renames to be a bit "nicer"
- intel_th driver updates
- normal set of hyper-v updates and changes
- small fpga subsystem and driver updates
- lots of const code changes all over the driver trees
- extcon driver updates
- fmc driver subsystem upadates
- w1 subsystem minor reworks and new features and drivers added
- spmi driver updates
Plus a smattering of other minor driver updates and fixes.
All of these have been in linux-next with no reported issues for a
while"
* tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
ANDROID: binder: don't queue async transactions to thread.
ANDROID: binder: don't enqueue death notifications to thread todo.
ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
ANDROID: binder: push new transactions to waiting threads.
ANDROID: binder: remove proc waitqueue
android: binder: Add page usage in binder stats
android: binder: fixup crash introduced by moving buffer hdr
drivers: w1: add hwmon temp support for w1_therm
drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
drivers: w1: add hwmon support structures
eeprom: idt_89hpesx: Support both ACPI and OF probing
mcb: Fix an error handling path in 'chameleon_parse_cells()'
MCB: add support for SC31 to mcb-lpc
mux: make device_type const
char: virtio: constify attribute_group structures.
Documentation/ABI: document the nvmem sysfs files
lkdtm: fix spelling mistake: "incremeted" -> "incremented"
perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
nvmem: include linux/err.h from header
...
- VMAP_STACK support, allowing the kernel stacks to be allocated in
the vmalloc space with a guard page for trapping stack overflows. One
of the patches introduces THREAD_ALIGN and changes the generic
alloc_thread_stack_node() to use this instead of THREAD_SIZE (no
functional change for other architectures)
- Contiguous PTE hugetlb support re-enabled (after being reverted a
couple of times). We now have the semantics agreed in the generic mm
layer together with API improvements so that the architecture code can
detect between contiguous and non-contiguous huge PTEs
- Initial support for persistent memory on ARM: DC CVAP instruction
exposed to user space (HWCAP) and the in-kernel pmem API implemented
- raid6 improvements for arm64: faster algorithm for the delta syndrome
and implementation of the recovery routines using Neon
- FP/SIMD refactoring and removal of support for Neon in interrupt
context. This is in preparation for full SVE support
- PTE accessors converted from inline asm to cmpxchg so that we can
use LSE atomics if available (ARMv8.1)
- Perf support for Cortex-A35 and A73
- Non-urgent fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAlmuunYACgkQa9axLQDI
XvEH9BAAo8V94GOMkX6HkT+2hjkl7DQ9krjumzmfzLV5AdgHMMzBNozmWKOCzgh0
yaxRcTUju3EyNeKhADr7yLiKDH8fnRPmYEJiVrwfgo7MaPApaCorr7LLIXfPGuxe
DTBHw+oxRMjlmaHeATX4PBWfQxAx+vjjhHqv3Qpmvdm4nYqR+0hZomH2BNsu64fk
AkSeUCxfCEyzSFIKuQM04M4zhSSZHz1tDxWI0b0RcK73qqEOuYZNkn6qxSKP5J4X
b2Y2U8nmxJ5C2fXpDYZaK9shiJ4Vu7X3Ocf/M7hsJzGY5z4dhnmUmxpHROaNiSvo
hCx7POYKyAPovps7zMSqcdsujkqOIQO8RHp4zGXx/pIr1RumjIiCY+RGpUYGibvU
N4Px5hZNneuHaPZZ+sWjOOdNB28xyzeUp2UK9Bb6uHB+/3xssMAD8Fd/b2ZLnS6a
YW3wrZmqA+ckfETsSRibabTs/ayqYHs2SDVwnlDJGtn+4Pw8oQpwGrwokxLQuuw3
uF2sNEPhJz+dcy21q3udYAQE1qOJBlLqTptgP96CHoVqh8X6nYSi5obT7y30ln3n
dhpZGOdi6R8YOouxgXS3Wg07pxn444L/VzDw5ku/5DkdryPOZCSRbk/2t8If6oDM
2VD6PCbTx3hsGc7SZ7FdSwIysD2j446u40OMGdH2iLB5jWBwyOM=
=vd0/
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- VMAP_STACK support, allowing the kernel stacks to be allocated in the
vmalloc space with a guard page for trapping stack overflows. One of
the patches introduces THREAD_ALIGN and changes the generic
alloc_thread_stack_node() to use this instead of THREAD_SIZE (no
functional change for other architectures)
- Contiguous PTE hugetlb support re-enabled (after being reverted a
couple of times). We now have the semantics agreed in the generic mm
layer together with API improvements so that the architecture code
can detect between contiguous and non-contiguous huge PTEs
- Initial support for persistent memory on ARM: DC CVAP instruction
exposed to user space (HWCAP) and the in-kernel pmem API implemented
- raid6 improvements for arm64: faster algorithm for the delta syndrome
and implementation of the recovery routines using Neon
- FP/SIMD refactoring and removal of support for Neon in interrupt
context. This is in preparation for full SVE support
- PTE accessors converted from inline asm to cmpxchg so that we can use
LSE atomics if available (ARMv8.1)
- Perf support for Cortex-A35 and A73
- Non-urgent fixes and cleanups
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (75 commits)
arm64: cleanup {COMPAT_,}SET_PERSONALITY() macro
arm64: introduce separated bits for mm_context_t flags
arm64: hugetlb: Cleanup setup_hugepagesz
arm64: Re-enable support for contiguous hugepages
arm64: hugetlb: Override set_huge_swap_pte_at() to support contiguous hugepages
arm64: hugetlb: Override huge_pte_clear() to support contiguous hugepages
arm64: hugetlb: Handle swap entries in huge_pte_offset() for contiguous hugepages
arm64: hugetlb: Add break-before-make logic for contiguous entries
arm64: hugetlb: Spring clean huge pte accessors
arm64: hugetlb: Introduce pte_pgprot helper
arm64: hugetlb: set_huge_pte_at Add WARN_ON on !pte_present
arm64: kexec: have own crash_smp_send_stop() for crash dump for nonpanic cores
arm64: dma-mapping: Mark atomic_pool as __ro_after_init
arm64: dma-mapping: Do not pass data to gen_pool_set_algo()
arm64: Remove the !CONFIG_ARM64_HW_AFDBM alternative code paths
arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect()
arm64: Move PTE_RDONLY bit handling out of set_pte_at()
kvm: arm64: Convert kvm_set_s2pte_readonly() from inline asm to cmpxchg()
arm64: Convert pte handling from inline asm to using (cmp)xchg
arm64: neon/efi: Make EFI fpsimd save/restore variables static
...
Pull x86 cache quality monitoring update from Thomas Gleixner:
"This update provides a complete rewrite of the Cache Quality
Monitoring (CQM) facility.
The existing CQM support was duct taped into perf with a lot of issues
and the attempts to fix those turned out to be incomplete and
horrible.
After lengthy discussions it was decided to integrate the CQM support
into the Resource Director Technology (RDT) facility, which is the
obvious choise as in hardware CQM is part of RDT. This allowed to add
Memory Bandwidth Monitoring support on top.
As a result the mechanisms for allocating cache/memory bandwidth and
the corresponding monitoring mechanisms are integrated into a single
management facility with a consistent user interface"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/intel_rdt: Turn off most RDT features on Skylake
x86/intel_rdt: Add command line options for resource director technology
x86/intel_rdt: Move special case code for Haswell to a quirk function
x86/intel_rdt: Remove redundant ternary operator on return
x86/intel_rdt/cqm: Improve limbo list processing
x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
x86/intel_rdt: Modify the intel_pqr_state for better performance
x86/intel_rdt/cqm: Clear the default RMID during hotcpu
x86/intel_rdt: Show bitmask of shareable resource with other executing units
x86/intel_rdt/mbm: Handle counter overflow
x86/intel_rdt/mbm: Add mbm counter initialization
x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
x86/intel_rdt/cqm: Add CPU hotplug support
x86/intel_rdt/cqm: Add sched_in support
x86/intel_rdt: Introduce rdt_enable_key for scheduling
x86/intel_rdt/cqm: Add mount,umount support
x86/intel_rdt/cqm: Add rmdir support
x86/intel_rdt: Separate the ctrl bits from rmdir
x86/intel_rdt/cqm: Add mon_data
x86/intel_rdt: Prepare for RDT monitor data support
...
Pull CPU hotplug fix from Thomas Gleixner:
"A single fix to handle the removal of the first dynamic CPU hotplug
state correctly"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp/hotplug: Handle removal correctly in cpuhp_store_callbacks()
Pull irq updates from Thomas Gleixner:
"The interrupt subsystem delivers this time:
- Refactoring of the GIC-V3 driver to prepare for the GIC-V4 support
- Initial GIC-V4 support
- Consolidation of the FSL MSI support
- Utilize the effective affinity interface in various ARM irqchip
drivers
- Yet another interrupt chip driver (UniPhier AIDET)
- Bulk conversion of the irq chip driver to use %pOF
- The usual small fixes and improvements all over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
irqchip/ls-scfg-msi: Add MSI affinity support
irqchip/ls-scfg-msi: Add LS1043a v1.1 MSI support
irqchip/ls-scfg-msi: Add LS1046a MSI support
arm64: dts: ls1046a: Add MSI dts node
arm64: dts: ls1043a: Share all MSIs
arm: dts: ls1021a: Share all MSIs
arm64: dts: ls1043a: Fix typo of MSI compatible string
arm: dts: ls1021a: Fix typo of MSI compatible string
irqchip/ls-scfg-msi: Fix typo of MSI compatible strings
irqchip/irq-bcm7120-l2: Use correct I/O accessors for irq_fwd_mask
irqchip/mmp: Make mmp_intc_conf const
irqchip/gic: Make irq_chip const
irqchip/gic-v3: Advertise GICv4 support to KVM
irqchip/gic-v4: Enable low-level GICv4 operations
irqchip/gic-v4: Add some basic documentation
irqchip/gic-v4: Add VLPI configuration interface
irqchip/gic-v4: Add VPE command interface
irqchip/gic-v4: Add per-VM VPE domain creation
irqchip/gic-v3-its: Set implementation defined bit to enable VLPIs
irqchip/gic-v3-its: Allow doorbell interrupts to be injected/cleared
...
Pull timer fixes from Thomas Gleixner:
"A rather small update for the time(r) subsystem:
- A new clocksource driver IMX-TPM
- Minor fixes to the alarmtimer facility
- Device tree cleanups for Renesas drivers
- A new kselftest and fixes for the timer related tests
- Conversion of the clocksource drivers to use %pOF
- Use the proper helpers to access rlimits in the posix-cpu-timer
code"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Ensure RTC module is not unloaded
clocksource: Convert to using %pOF instead of full_name
clocksource/drivers/bcm2835: Remove message for a memory allocation failure
devicetree: bindings: Remove deprecated properties
devicetree: bindings: Remove unused 32-bit CMT bindings
devicetree: bindings: Deprecate property, update example
devicetree: bindings: r8a73a4 and R-Car Gen2 CMT bindings
devicetree: bindings: R-Car Gen2 CMT0 and CMT1 bindings
devicetree: bindings: Remove sh7372 CMT binding
clocksource/drivers/imx-tpm: Add imx tpm timer support
dt-bindings: timer: Add nxp tpm timer binding doc
posix-cpu-timers: Use dedicated helper to access rlimit values
alarmtimer: Fix unavailable wake-up source in sysfs
timekeeping: Use proper timekeeper for debug code
kselftests: timers: set-timer-lat: Add one-shot timer test cases
kselftests: timers: set-timer-lat: Tweak reporting when timer fires early
kselftests: timers: freq-step: Fix build warning
kselftests: timers: freq-step: Define ADJ_SETOFFSET if device has older kernel headers
Pull x86 mm changes from Ingo Molnar:
"PCID support, 5-level paging support, Secure Memory Encryption support
The main changes in this cycle are support for three new, complex
hardware features of x86 CPUs:
- Add 5-level paging support, which is a new hardware feature on
upcoming Intel CPUs allowing up to 128 PB of virtual address space
and 4 PB of physical RAM space - a 512-fold increase over the old
limits. (Supercomputers of the future forecasting hurricanes on an
ever warming planet can certainly make good use of more RAM.)
Many of the necessary changes went upstream in previous cycles,
v4.14 is the first kernel that can enable 5-level paging.
This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
default.
(By Kirill A. Shutemov)
- Add 'encrypted memory' support, which is a new hardware feature on
upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
RAM to be encrypted and decrypted (mostly) transparently by the
CPU, with a little help from the kernel to transition to/from
encrypted RAM. Such RAM should be more secure against various
attacks like RAM access via the memory bus and should make the
radio signature of memory bus traffic harder to intercept (and
decrypt) as well.
This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
by default.
(By Tom Lendacky)
- Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
hardware feature that attaches an address space tag to TLB entries
and thus allows to skip TLB flushing in many cases, even if we
switch mm's.
(By Andy Lutomirski)
All three of these features were in the works for a long time, and
it's coincidence of the three independent development paths that they
are all enabled in v4.14 at once"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
x86/mm: Use pr_cont() in dump_pagetable()
x86/mm: Fix SME encryption stack ptr handling
kvm/x86: Avoid clearing the C-bit in rsvd_bits()
x86/CPU: Align CR3 defines
x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
acpi, x86/mm: Remove encryption mask from ACPI page protection type
x86/mm, kexec: Fix memory corruption with SME on successive kexecs
x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
x86/mm: Allow userspace have mappings above 47-bit
x86/mm: Prepare to expose larger address space to userspace
x86/mpx: Do not allow MPX if we have mappings above 47-bit
x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
x86/mm/dump_pagetables: Fix printout of p4d level
x86/mm/dump_pagetables: Generalize address normalization
x86/boot: Fix memremap() related build failure
...
Pull locking updates from Ingo Molnar:
- Add 'cross-release' support to lockdep, which allows APIs like
completions, where it's not the 'owner' who releases the lock, to be
tracked. It's all activated automatically under
CONFIG_PROVE_LOCKING=y.
- Clean up (restructure) the x86 atomics op implementation to be more
readable, in preparation of KASAN annotations. (Dmitry Vyukov)
- Fix static keys (Paolo Bonzini)
- Add killable versions of down_read() et al (Kirill Tkhai)
- Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)
- Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)
- Remove smp_mb__before_spinlock() and convert its usages, introduce
smp_mb__after_spinlock() (Peter Zijlstra)
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
locking/lockdep/selftests: Fix mixed read-write ABBA tests
sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
smp: Avoid using two cache lines for struct call_single_data
locking/lockdep: Untangle xhlock history save/restore from task independence
locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
futex: Remove duplicated code and fix undefined behaviour
Documentation/locking/atomic: Finish the document...
locking/lockdep: Fix workqueue crossrelease annotation
workqueue/lockdep: 'Fix' flush_work() annotation
locking/lockdep/selftests: Add mixed read-write ABBA tests
mm, locking/barriers: Clarify tlb_flush_pending() barriers
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
locking/lockdep: Explicitly initialize wq_barrier::done::map
locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
locking/refcounts, x86/asm: Implement fast refcount overflow protection
locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- fix affine wakeups (Peter Zijlstra)
- improve CPU onlining (and general bootup) scalability on systems
with ridiculous number (thousands) of CPUs (Peter Zijlstra)
- sched/numa updates (Rik van Riel)
- sched/deadline updates (Byungchul Park)
- sched/cpufreq enhancements and related cleanups (Viresh Kumar)
- sched/debug enhancements (Xie XiuQi)
- various fixes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
sched/debug: Optimize sched_domain sysctl generation
sched/topology: Avoid pointless rebuild
sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
sched/topology: Improve comments
sched/topology: Fix memory leak in __sdt_alloc()
sched/completion: Document that reinit_completion() must be called after complete_all()
sched/autogroup: Fix error reporting printk text in autogroup_create()
sched/fair: Fix wake_affine() for !NUMA_BALANCING
sched/debug: Intruduce task_state_to_char() helper function
sched/debug: Show task state in /proc/sched_debug
sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
sched/core: Remove unnecessary initialization init_idle_bootup_task()
sched/deadline: Change return value of cpudl_find()
sched/deadline: Make find_later_rq() choose a closer CPU in topology
sched/numa: Scale scan period with tasks in group and shared/private
sched/numa: Slow down scan rate if shared faults dominate
sched/pelt: Fix false running accounting
sched: Mark pick_next_task_dl() and build_sched_domain() as static
sched/cpupri: Don't re-initialize 'struct cpupri'
sched/deadline: Don't re-initialize 'struct cpudl'
...
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Add branch type profiling/tracing support. (Jin Yao)
- Add the PERF_SAMPLE_PHYS_ADDR ABI to allow the tracing/profiling of
physical memory addresses, where the PMU supports it. (Kan Liang)
- Export some PMU capability details in the new
/sys/bus/event_source/devices/cpu/caps/ sysfs directory. (Andi
Kleen)
- Aux data fixes and updates (Will Deacon)
- kprobes fixes and updates (Masami Hiramatsu)
- AMD uncore PMU driver fixes and updates (Janakarajan Natarajan)
On the tooling side, here's a (limited!) list of highlights - there
were many other changes that I could not list, see the shortlog and
git history for details:
UI improvements:
- Implement a visual marker for fused x86 instructions in the
annotate TUI browser, available now in 'perf report', more work
needed to have it available as well in 'perf top' (Jin Yao)
Further explanation from one of Jin's patches:
│ ┌──cmpl $0x0,argp_program_version_hook
81.93 │ ├──je 20
│ │ lock cmpxchg %esi,0x38a9a4(%rip)
│ │↓ jne 29
│ │↓ jmp 43
11.47 │20:└─→cmpxch %esi,0x38a999(%rip)
That means the cmpl+je is a fused instruction pair and they should
be considered together.
- Record the branch type and then show statistics and info about in
callchain entries (Jin Yao)
Example from one of Jin's patches:
# perf record -g -j any,save_type
# perf report --branch-history --stdio --no-children
38.50% div.c:45 [.] main div
|
---main div.c:42 (RET CROSS_2M cycles:2)
compute_flag div.c:28 (cycles:2)
compute_flag div.c:27 (RET CROSS_2M cycles:1)
rand rand.c:28 (cycles:1)
rand rand.c:28 (RET CROSS_2M cycles:1)
__random random.c:298 (cycles:1)
__random random.c:297 (COND_BWD CROSS_2M cycles:1)
__random random.c:295 (cycles:1)
__random random.c:295 (COND_BWD CROSS_2M cycles:1)
__random random.c:295 (cycles:1)
__random random.c:295 (RET CROSS_2M cycles:9)
namespaces support:
- Add initial support for namespaces, using setns to access files in
namespaces, grabbing their build-ids, etc. (Krister Johansen)
perf trace enhancements:
- Beautify pkey_{alloc,free,mprotect} arguments in 'perf trace'
(Arnaldo Carvalho de Melo)
- Add initial 'clone' syscall args beautifier in 'perf trace'
(Arnaldo Carvalho de Melo)
- Ignore 'fd' and 'offset' args for MAP_ANONYMOUS in 'perf trace'
(Arnaldo Carvalho de Melo)
- Beautifiers for the 'cmd' arg of several ioctl types, including:
sound, DRM, KVM, vhost virtio and perf_events. (Arnaldo Carvalho de
Melo)
- Add PERF_SAMPLE_CALLCHAIN and PERF_RECORD_MMAP[2] to 'perf data'
CTF conversion, allowing CTF trace visualization tools to show
callchains and to resolve symbols (Geneviève Bastien)
- Beautify the fcntl syscall, which is an interesting one in the
sense that infrastructure had to be put in place to change the
formatters of some arguments according to the value in a previous
one, i.e. cmd dictates how arg and the syscall return will be
formatted. (Arnaldo Carvalho de Melo
perf stat enhancements:
- Use group read for event groups in 'perf stat', reducing overhead
when groups are defined in the event specification, i.e. when using
{} to enclose a list of events, asking them to be read at the same
time, e.g.: "perf stat -e '{cycles,instructions}'" (Jiri Olsa)
pipe mode improvements:
- Process tracing data in 'perf annotate' pipe mode (David
Carrillo-Cisneros)
- Add header record types to pipe-mode, now this command:
$ perf record -o - -e cycles sleep 1 | perf report --stdio --header
Will show the same as in non-pipe mode, i.e. involving a perf.data
file (David Carrillo-Cisneros)
Vendor specific hardware event support updates/enhancements:
- Update POWER9 vendor events tables (Sukadev Bhattiprolu)
- Add POWER9 PMU events Sukadev (Bhattiprolu)
- Support additional POWER8+ PVR in PMU mapfile (Shriya)
- Add Skylake server uncore JSON vendor events (Andi Kleen)
- Support exporting Intel PT data to sqlite3 with python perf
scripts, this is in addition to the postgresql support that was
already there (Adrian Hunter)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (253 commits)
perf symbols: Fix plt entry calculation for ARM and AARCH64
perf probe: Fix kprobe blacklist checking condition
perf/x86: Fix caps/ for !Intel
perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR
perf/core, pt, bts: Get rid of itrace_started
perf trace beauty: Beautify pkey_{alloc,free,mprotect} arguments
tools headers: Sync cpu features kernel ABI headers with tooling headers
perf tools: Pass full path of FEATURES_DUMP
perf tools: Robustify detection of clang binary
tools lib: Allow external definition of CC, AR and LD
perf tools: Allow external definition of flex and bison binary names
tools build tests: Don't hardcode gcc name
perf report: Group stat values on global event id
perf values: Zero value buffers
perf values: Fix allocation check
perf values: Fix thread index bug
perf report: Add dump_read function
perf record: Set read_format for inherit_stat
perf c2c: Fix remote HITM detection for Skylake
perf tools: Fix static build with newer toolchains
...
Pull RCU updates from Ingo Molnad:
"The main RCU related changes in this cycle were:
- Removal of spin_unlock_wait()
- SRCU updates
- RCU torture-test updates
- RCU Documentation updates
- Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
- Miscellaneous RCU fixes
- CPU-hotplug fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
arch: Remove spin_unlock_wait() arch-specific definitions
locking: Remove spin_unlock_wait() generic definitions
drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
ipc: Replace spin_unlock_wait() with lock/unlock pair
exit: Replace spin_unlock_wait() with lock/unlock pair
completion: Replace spin_unlock_wait() with lock/unlock pair
doc: Set down RCU's scheduling-clock-interrupt needs
doc: No longer allowed to use rcu_dereference on non-pointers
doc: Add RCU files to docbook-generation files
doc: Update memory-barriers.txt for read-to-write dependencies
doc: Update RCU documentation
membarrier: Provide expedited private command
rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
rcu: Add warning to rcu_idle_enter() for irqs enabled
rcu: Make rcu_idle_enter() rely on callers disabling irqs
rcu: Add assertions verifying blocked-tasks list
rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
rcu: Add TPS() protection for _rcu_barrier_trace strings
rcu: Use idle versions of swait to make idle-hack clear
swait: Add idle variants which don't contribute to load average
...
* pm-sleep:
ACPI / PM: Check low power idle constraints for debug only
PM / s2idle: Rename platform operations structure
PM / s2idle: Rename ->enter_freeze to ->enter_s2idle
PM / s2idle: Rename freeze_state enum and related items
PM / s2idle: Rename PM_SUSPEND_FREEZE to PM_SUSPEND_TO_IDLE
ACPI / PM: Prefer suspend-to-idle over S3 on some systems
platform/x86: intel-hid: Wake up Dell Latitude 7275 from suspend-to-idle
PM / suspend: Define pr_fmt() in suspend.c
PM / suspend: Use mem_sleep_labels[] strings in messages
PM / sleep: Put pm_test under CONFIG_PM_SLEEP_DEBUG
PM / sleep: Check pm_wakeup_pending() in __device_suspend_noirq()
PM / core: Add error argument to dpm_show_time()
PM / core: Split dpm_suspend_noirq() and dpm_resume_noirq()
PM / s2idle: Rearrange the main suspend-to-idle loop
PM / timekeeping: Print debug messages when requested
PM / sleep: Mark suspend/hibernation start and finish
PM / sleep: Do not print debug messages by default
PM / suspend: Export pm_suspend_target_state
* pm-cpufreq-sched:
cpufreq: schedutil: Always process remote callback with slow switching
cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
cpufreq: Return 0 from ->fast_switch() on errors
cpufreq: Simplify cpufreq_can_do_remote_dvfs()
cpufreq: Process remote callbacks from any CPU if the platform permits
sched: cpufreq: Allow remote cpufreq callbacks
cpufreq: schedutil: Use unsigned int for iowait boost
cpufreq: schedutil: Make iowait boost more energy efficient
* pm-cpufreq: (33 commits)
cpufreq: imx6q: Fix imx6sx low frequency support
cpufreq: speedstep-lib: make several arrays static, makes code smaller
cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
cpufreq: dt-platdev: Drop few entries from whitelist
cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
ARM: ux500: don't select CPUFREQ_DT
cpufreq: Convert to using %pOF instead of full_name
cpufreq: Cap the default transition delay value to 10 ms
cpufreq: dbx500: Delete obsolete driver
mfd: db8500-prcmu: Get rid of cpufreq dependency
cpufreq: enable the DT cpufreq driver on the Ux500
cpufreq: Loongson2: constify platform_device_id
cpufreq: dt: Add r8a7796 support to to use generic cpufreq driver
cpufreq: remove setting of policy->cpu in policy->cpus during init
cpufreq: mediatek: add support of cpufreq to MT7622 SoC
cpufreq: mediatek: add cleanups with the more generic naming
cpufreq: rcar: Add support for R8A7795 SoC
cpufreq: dt: Add rk3328 compatible to use generic cpufreq driver
cpufreq: s5pv210: add missing of_node_put()
cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency
...
* pm-core:
PM / wakeup: Set power.can_wakeup if wakeup_sysfs_add() fails
* pm-opp:
PM / OPP: Fix get sharing CPUs when hotplug is used
PM / OPP: OF: Use pr_debug() instead of pr_err() while adding OPP table
* pm-domains:
PM / Domains: Convert to using %pOF instead of full_name
PM / Domains: Extend generic power domain debugfs
PM / Domains: Add time accounting to various genpd states
* pm-cpu:
PM / CPU: replace raw_notifier with atomic_notifier
* pm-avs:
PM / AVS: rockchip-io: add io selectors and supplies for RV1108
Pull timer fix from Thomas Gleixner:
"A single fix for a thinko in the raw timekeeper update which causes
clock MONOTONIC_RAW to run with erratically increased frequency"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Fix ktime_get_raw() incorrect base accumulation
Pull perf fixes from Thomas Gleixner:
- Prevent a potential inconistency in the perf user space access which
might lead to evading sanity checks.
- Prevent perf recording function trace entries twice
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/ftrace: Fix double traces of perf on ftrace:function
perf/core: Fix potential double-fetch bug
Instead of tracking wmem_queued and sk_mem_charge by incrementing
in the verdict SK_REDIRECT paths and decrementing in the tx work
path use skb_set_owner_w and sock_writeable helpers. This solves
a few issues with the current code. First, in SK_REDIRECT inc on
sk_wmem_queued and sk_mem_charge were being done without the peers
sock lock being held. Under stress this can result in accounting
errors when tx work and/or multiple verdict decisions are working
on the peer psock.
Additionally, this cleans up the code because we can rely on the
default destructor to decrement memory accounting on kfree_skb. Also
this will trigger sk_write_space when space becomes available on
kfree_skb() which wasn't happening before and prevent __sk_free
from being called until all in-flight packets are completed.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix handling of pinned BPF map nodes in hash of maps, from Daniel
Borkmann.
2) IPSEC ESP error paths leak memory, from Steffen Klassert.
3) We need an RCU grace period before freeing fib6_node objects, from
Wei Wang.
4) Must check skb_put_padto() return value in HSR driver, from FLorian
Fainelli.
5) Fix oops on PHY probe failure in ftgmac100 driver, from Andrew
Jeffery.
6) Fix infinite loop in UDP queue when using SO_PEEK_OFF, from Eric
Dumazet.
7) Use after free when tcf_chain_destroy() called multiple times, from
Jiri Pirko.
8) Fix KSZ DSA tag layer multiple free of SKBS, from Florian Fainelli.
9) Fix leak of uninitialized memory in sctp_get_sctp_info(),
inet_diag_msg_sctpladdrs_fill() and inet_diag_msg_sctpaddrs_fill().
From Stefano Brivio.
10) L2TP tunnel refcount fixes from Guillaume Nault.
11) Don't leak UDP secpath in udp_set_dev_scratch(), from Yossi
Kauperman.
12) Revert a PHY layer change wrt. handling of PHY_HALTED state in
phy_stop_machine(), it causes regressions for multiple people. From
Florian Fainelli.
13) When packets are sent out of br0 we have to clear the
offload_fwdq_mark value.
14) Several NULL pointer deref fixes in packet schedulers when their
->init() routine fails. From Nikolay Aleksandrov.
15) Aquantium devices cannot checksum offload correctly when the packet
is <= 60 bytes. From Pavel Belous.
16) Fix vnet header access past end of buffer in AF_PACKET, from
Benjamin Poirier.
17) Double free in probe error paths of nfp driver, from Dan Carpenter.
18) QOS capability not checked properly in DCB init paths of mlx5
driver, from Huy Nguyen.
19) Fix conflicts between firmware load failure and health_care timer in
mlx5, also from Huy Nguyen.
20) Fix dangling page pointer when DMA mapping errors occur in mlx5,
from Eran Ben ELisha.
21) ->ndo_setup_tc() in bnxt_en driver doesn't count rings properly,
from Michael Chan.
22) Missing MSIX vector free in bnxt_en, also from Michael Chan.
23) Refcount leak in xfrm layer when using sk_policy, from Lorenzo
Colitti.
24) Fix copy of uninitialized data in qlge driver, from Arnd Bergmann.
25) bpf_setsockopts() erroneously always returns -EINVAL even on
success. Fix from Yuchung Cheng.
26) tipc_rcv() needs to linearize the SKB before parsing the inner
headers, from Parthasarathy Bhuvaragan.
27) Fix deadlock between link status updates and link removal in netvsc
driver, from Stephen Hemminger.
28) Missed locking of page fragment handling in ESP output, from Steffen
Klassert.
29) Fix refcnt leak in ebpf congestion control code, from Sabrina
Dubroca.
30) sxgbe_probe_config_dt() doesn't check devm_kzalloc()'s return value,
from Christophe Jaillet.
31) Fix missing ipv6 rx_dst_cookie update when rx_dst is updated during
early demux, from Paolo Abeni.
32) Several info leaks in xfrm_user layer, from Mathias Krause.
33) Fix out of bounds read in cxgb4 driver, from Stefano Brivio.
34) Properly propagate obsolete state of route upwards in ipv6 so that
upper holders like xfrm can see it. From Xin Long.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (118 commits)
udp: fix secpath leak
bridge: switchdev: Clear forward mark when transmitting packet
mlxsw: spectrum: Forbid linking to devices that have uppers
wl1251: add a missing spin_lock_init()
Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278
kcm: do not attach PF_KCM sockets to avoid deadlock
sch_tbf: fix two null pointer dereferences on init failure
sch_sfq: fix null pointer dereference on init failure
sch_netem: avoid null pointer deref on init failure
sch_fq_codel: avoid double free on init failure
sch_cbq: fix null pointer dereferences on init failure
sch_hfsc: fix null pointer deref and double free on init failure
sch_hhf: fix null pointer dereference on init failure
sch_multiq: fix double free on init failure
sch_htb: fix crash on init failure
net/mlx5e: Fix CQ moderation mode not set properly
net/mlx5e: Fix inline header size for small packets
net/mlx5: E-Switch, Unload the representors in the correct order
net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address
...
This patch writes 'node->ref = 1' only if node->ref is 0.
The number of lookups/s for a ~1M entries LRU map increased by
~30% (260097 to 343313).
Other writes on 'node->ref = 0' is not changed. In those cases, the
same cache line has to be changed anyway.
First column: Size of the LRU hash
Second column: Number of lookups/s
Before:
> echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
1048577: 260097
After:
> echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
1048577: 343313
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inline the lru map lookup to save the cost in making calls to
bpf_map_lookup_elem() and htab_lru_map_lookup_elem().
Different LRU hash size is tested. The benefit diminishes when
the cache miss starts to dominate in the bigger LRU hash.
Considering the change is simple, it is still worth to optimize.
First column: Size of the LRU hash
Second column: Number of lookups/s
Before:
> for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
513: 1132020
1025: 1056826
2049: 1007024
4097: 853298
8193: 742723
16385: 712600
32769: 688142
65537: 677028
131073: 619437
262145: 498770
524289: 316695
1048577: 260038
After:
> for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
513: 1221851
1025: 1144695
2049: 1049902
4097: 884460
8193: 773731
16385: 729673
32769: 721989
65537: 715530
131073: 671665
262145: 516987
524289: 321125
1048577: 260048
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().
However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm(). For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.
Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().
With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e8665b4 ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function. For example:
$ gcc reproducer.c -o reproducer -lpthread
$ nm reproducer | grep fork_thread
0000000000400719 t fork_thread
$ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
$ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
$ ./reproducer
Here is the use-after-free reported by KASAN:
BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
dump_stack+0xdb/0x185
print_address_description+0x7e/0x290
kasan_report+0x23b/0x350
__asan_report_load8_noabort+0x19/0x20
uprobe_clear_state+0x1c4/0x200
mmput+0xd6/0x360
do_exit+0x740/0x1670
do_group_exit+0x13f/0x380
get_signal+0x597/0x17d0
do_signal+0x99/0x1df0
exit_to_usermode_loop+0x166/0x1e0
syscall_return_slowpath+0x258/0x2c0
entry_SYSCALL_64_fastpath+0xbc/0xbe
...
Allocated by task 199:
save_stack_trace+0x1b/0x20
kasan_kmalloc+0xfc/0x180
kmem_cache_alloc_trace+0xf3/0x330
__create_xol_area+0x10f/0x780
uprobe_notify_resume+0x1674/0x2210
exit_to_usermode_loop+0x150/0x1e0
prepare_exit_to_usermode+0x14b/0x180
retint_user+0x8/0x20
Freed by task 199:
save_stack_trace+0x1b/0x20
kasan_slab_free+0xa8/0x1a0
kfree+0xba/0x210
uprobe_clear_state+0x151/0x200
mmput+0xd6/0x360
copy_process.part.8+0x605f/0x65d0
_do_fork+0x1a5/0xbd0
SyS_clone+0x19/0x20
do_syscall_64+0x22f/0x660
return_from_SYSCALL_64+0x0/0x7a
Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.
Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
Fixes: 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the worker thread continues getting work, it will hog the cpu and rcu
stall complains. Make it a good citizen. This is triggered in a loop
block device test.
Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When registering the rtc device to be used to handle alarm timers,
get_device is used to ensure the device doesn't go away but the module can
still be unloaded.
Call try_module_get to ensure the rtc driver will not go away.
Reported-and-tested-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170820220146.30969-1-alexandre.belloni@free-electrons.com
- irqchip-specific part of the monster GICv4 series
- new UniPhier AIDET irqchip driver
- new variants of some Freescale MSI widget
- blanket removal of of_node->full_name in printk
- random collection of fixes
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlmoQxkVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDVUMQAIyE1q3fjSNZ+EkfK8+mbcWC80Wc
suklgcqVbHahu6FHuHALlR7rgJIPSaFYFpDIwybA9A0Pwia/5Jf2mOL3RGVF4f97
nyHlSS16kocZz8lKn+NtgcaUiFRma3y7GNek0pnsSlm+Vu+Syw3xssN+yYcGujTu
jWRocvIqIJlScpzHG/Ulx3tZTXYfipQFfIQ3+9gm/i+KYqTwGDH/MsdxI7uAbctx
YJGwLVtv4MGGmNHaq4iS64d55yrG/4Yqv+q92zFaaxj+V0di+Ds01+MDhdq8X7N/
fhLGY/Yh/I3FiIIdIO/O1sj1EPO6lLbg4DPYXIMdjzwhBdKhu8i66/ttH/Kx//Aa
1hhLZSN6rYiJM3lWcTxej45bs8MR/3MBm4gKpZxTgJ12YRIwgY8lRyoqXTlto5ls
w10yi5wFsJaAO1E/HdEs/dyndV1jpvGo9KIRnfh7E5+Hw7PCYs9kZa4MUtq9RYT8
Civyppi2sMfKYtGvwm+FS6sIigoFCh4DJ5MmUbM5CLh5imnggyYJlTsJdBuxVDZM
1RoDnX/YebpVceezIZ/oCKq60Utck0Oqge2pc+NjVQupAp/x/13R/7DQPnFCq/OL
Avx9kBtSzdYmYgE3EWt9n+h4LT23JpOym2OEUF3fhpPE96BKAJkMEPB/QlBi39fo
0cZEX8M7xq5KjRJy
=3ZS3
-----END PGP SIGNATURE-----
Merge tag 'irqchip-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates for 4.14 from Marc Zyngier:
- irqchip-specific part of the monster GICv4 series
- new UniPhier AIDET irqchip driver
- new variants of some Freescale MSI widget
- blanket removal of of_node->full_name in printk
- random collection of fixes
Pull cgroup fix from Tejun Heo:
"A late but obvious fix for cgroup.
I broke the 'cpuset.memory_pressure' file a long time ago (v4.4) by
accidentally deleting its file index, which made it a duplicate of the
'cpuset.memory_migrate' file. Spotted and fixed by Waiman"
* 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: Fix incorrect memory_pressure control file mapping
All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:
- pv_queued_spin_steal_lock()
- trylock_clear_pending()
This change should help performance on architectures that use LL/SC.
The cmpxchg in pv_kick_node() is replaced with a relaxed version
with explicit memory barrier to make sure that it is fully ordered
in the writing of next->lock and the reading of pn->state whether
the cmpxchg is a success or failure without affecting performance in
non-LL/SC architectures.
On a 2-socket 12-core 96-thread Power8 system with pvqspinlock
explicitly enabled, the performance of a locking microbenchmark
with and without this patch on a 4.13-rc4 kernel with Xinhui's PPC
qspinlock patch were as follows:
# of thread w/o patch with patch % Change
----------- --------- ---------- --------
8 5054.8 Mop/s 5209.4 Mop/s +3.1%
16 3985.0 Mop/s 4015.0 Mop/s +0.8%
32 2378.2 Mop/s 2396.0 Mop/s +0.7%
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1502741222-24360-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
struct call_single_data is used in IPIs to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Currently it is not allocated with any explicit alignment
requirements. This makes it possible for allocated call_single_data to
cross two cache lines, which results in double the number of the cache lines
that need to be transferred among CPUs.
This can be fixed by requiring call_single_data to be aligned with the
size of call_single_data. Currently the size of call_single_data is the
power of 2. If we add new fields to call_single_data, we may need to
add padding to make sure the size of new definition is the power of 2
as well.
Fortunately, this is enforced by GCC, which will report bad sizes.
To set alignment requirements of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.
To test the effect of the patch, I used the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPIs are triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data, because of faster IPIs.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Huang, Ying <ying.huang@intel.com>
[ Add call_single_data_t and align with size of call_single_data. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to
ensure the temporal IRQ events don't interact with task state, the
XHLOCK_PROC is a fundament different beast that just happens to share
the interface.
The purpose of XHLOCK_PROC is to annotate independent execution inside
one task. For example workqueues, each work should appear to run in its
own 'pristine' 'task'.
Remove XHLOCK_PROC in favour of its own interface to avoid confusion.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: kernel-team@lge.com
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For understanding how the workload maps to memory channels and hardware
behavior, it's very important to collect address maps with physical
addresses. For example, 3D XPoint access can only be found by filtering
the physical address.
Add a new sample type for physical address.
perf already has a facility to collect data virtual address. This patch
introduces a function to convert the virtual address to physical address.
The function is quite generic and can be extended to any architecture as
long as a virtual address is provided.
- For kernel direct mapping addresses, virt_to_phys is used to convert
the virtual addresses to physical address.
- For user virtual addresses, __get_user_pages_fast is used to walk the
pages tables for user physical address.
- This does not work for vmalloc addresses right now. These are not
resolved, but code to do that could be added.
The new sample type requires collecting the virtual address. The
virtual address will not be output unless SAMPLE_ADDR is applied.
For security, the physical address can only be exposed to root or
privileged user.
Tested-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I just noticed that hw.itrace_started and hw.config are aliased to the
same location. Now, the PT driver happens to use both, which works out
fine by sheer luck:
- STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
program order, although there are no compiler barriers to ensure that,
- to the perf_log_itrace_start() hw.itrace_start looks set at the same
time as when it is intended to be set because both stores happen in the
same path,
- hw.config is never reset to zero in the PT driver.
Now, the use of hw.config by the PT driver makes more sense (it being a
HW PMU) than messing around with itrace_started, which is an awkward API
to begin with.
This patch replaces hw.itrace_started with an attach_state bit and an
API call for the PMU drivers to use to communicate the condition.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When running perf on the ftrace:function tracepoint, there is a bug
which can be reproduced by:
perf record -e ftrace:function -a sleep 20 &
perf record -e ftrace:function ls
perf script
ls 10304 [005] 171.853235: ftrace:function:
perf_output_begin
ls 10304 [005] 171.853237: ftrace:function:
perf_output_begin
ls 10304 [005] 171.853239: ftrace:function:
task_tgid_nr_ns
ls 10304 [005] 171.853240: ftrace:function:
task_tgid_nr_ns
ls 10304 [005] 171.853242: ftrace:function:
__task_pid_nr_ns
ls 10304 [005] 171.853244: ftrace:function:
__task_pid_nr_ns
We can see that all the function traces are doubled.
The problem is caused by the inconsistency of the register
function perf_ftrace_event_register() with the probe function
perf_ftrace_function_call(). The former registers one probe
for every perf_event. And the latter handles all perf_events
on the current cpu. So when two perf_events on the current cpu,
the traces of them will be doubled.
So this patch adds an extra parameter "event" for perf_tp_event,
only send sample data to this event when it's not NULL.
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: huawei.libin@huawei.com
Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where the same
userspace memory region are fetched twice into kernel with sanity checks after
the first fetch while missing checks after the second fetch.
1. The first fetch happens in line 9573 get_user(size, &uattr->size).
2. Subsequently the 'size' variable undergoes a few sanity checks and
transformations (line 9577 to 9584).
3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size)
4. Given that 'uattr' can be fully controlled in userspace, an attacker can
race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF)
after the first fetch but before the second fetch. The changed value will be
copied to 'attr->size'.
5. There is no further checks on 'attr->size' until the end of this function,
and once the function returns, we lose the context to verify that 'attr->size'
conforms to the sanity checks performed in step 2 (line 9577 to 9584).
6. My manual analysis shows that 'attr->size' is not used elsewhere later,
so, there is no working exploit against it right now. However, this could
easily turns to an exploitable one if careless developers start to use
'attr->size' later.
To fix this, override 'attr->size' from the second fetch to the one from the
first fetch, regardless of what is actually copied in.
In this way, it is assured that 'attr->size' is consistent with the checks
performed after the first fetch.
Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: meng.xu@gatech.edu
Cc: sanidhya@gatech.edu
Cc: taesoo@gatech.edu
Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
"err" is set to zero if bpf_map_area_alloc() fails so it means we return
ERR_PTR(0) which is NULL. The caller, find_and_alloc_map(), is not
expecting NULL returns and will oops.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After userspace pushes sockets into a sockmap it may not be receiving
data (assuming stream_{parser|verdict} programs are attached). But, it
may still want to manage the socks. A common pattern is to poll/select
for a POLLRDHUP event so we can close the sock.
This patch adds the logic to wake up these listeners.
Also add TCP_SYN_SENT to the list of events to handle. We don't want
to break the connection just because we happen to be in this state.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When attaching a program to sockmap we need to check map type
is correct.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
References to psock must be done inside RCU critical section.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The addition of map_flags BPF_SOCKMAP_STRPARSER flags was to handle a
specific use case where we want to have BPF parse program disabled on
an entry in a sockmap.
However, Alexei found the API a bit cumbersome and I agreed. Lets
remove the STRPARSER flag and support the use case by allowing socks
to be in multiple maps. This allows users to create two maps one with
programs attached and one without. When socks are added to maps they
now inherit any programs attached to the map. This is a nice
generalization and IMO improves the API.
The API rules are less ambiguous and do not need a flag:
- When a sock is added to a sockmap we have two cases,
i. The sock map does not have any attached programs so
we can add sock to map without inheriting bpf programs.
The sock may exist in 0 or more other maps.
ii. The sock map has an attached BPF program. To avoid duplicate
bpf programs we only add the sock entry if it does not have
an existing strparser/verdict attached, returning -EBUSY if
a program is already attached. Otherwise attach the program
and inherit strparser/verdict programs from the sock map.
This allows for socks to be in a multiple maps for redirects and
inherit a BPF program from a single map.
Also this patch simplifies the logic around BPF_{EXIST|NOEXIST|ANY}
flags. In the original patch I tried to be extra clever and only
update map entries when necessary. Now I've decided the complexity
is not worth it. If users constantly update an entry with the same
sock for no reason (i.e. update an entry without actually changing
any parameters on map or sock) we still do an alloc/release. Using
this and allowing multiple entries of a sock to exist in a map the
logic becomes much simpler.
Note: Now that multiple maps are supported the "maps" pointer called
when a socket is closed becomes a list of maps to remove the sock from.
To keep the map up to date when a sock is added to the sockmap we must
add the map/elem in the list. Likewise when it is removed we must
remove it from the list. This results in searching the per psock list
on delete operation. On TCP_CLOSE events we walk the list and remove
the psock from all map/entry locations. I don't see any perf
implications in this because at most I have a psock in two maps. If
a psock were to be in many maps its possibly this might be noticeable
on delete but I can't think of a reason to dup a psock in many maps.
The sk_callback_lock is used to protect read/writes to the list. This
was convenient because in all locations we were taking the lock
anyways just after working on the list. Also the lock is per sock so
in normal cases we shouldn't see any contention.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the initial sockmap API we provided strparser and verdict programs
using a single attach command by extending the attach API with a the
attach_bpf_fd2 field.
However, if we add other programs in the future we will be adding a
field for every new possible type, attach_bpf_fd(3,4,..). This
seems a bit clumsy for an API. So lets push the programs using two
new type fields.
BPF_SK_SKB_STREAM_PARSER
BPF_SK_SKB_STREAM_VERDICT
This has the advantage of having a readable name and can easily be
extended in the future.
Updates to samples and sockmap included here also generalize tests
slightly to support upcoming patch for multiple map support.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer fix from Ingo Molnar:
"Fix a timer granularity handling race+bug, which would manifest itself
by spuriously increasing timeouts of some timers (from 1 jiffy to ~500
jiffies in the worst case measured) in certain nohz states"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix excessive granularity of new timers after a nohz idle