This patch randomizes high 32-bit of a definition when BPF_F_TEST_RND_HI32
is set.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
x86_64 and AArch64 perhaps are two arches that running bpf testsuite
frequently, however the zero extension insertion pass is not enabled for
them because of their hardware support.
It is critical to guarantee the pass correction as it is supposed to be
enabled at default for a couple of other arches, for example PowerPC,
SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way
to test this pass on for example x86_64.
The test methodology employed by this set is "poisoning" useless bits. High
32-bit of a definition is randomized if it is identified as not used by any
later insn. Such randomization is only enabled under testing mode which is
gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32".
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After previous patches, verifier will mark a insn if it really needs zero
extension on dst_reg.
It is then for back-ends to decide how to use such information to eliminate
unnecessary zero extension code-gen during JIT compilation.
One approach is verifier insert explicit zero extension for those insns
that need zero extension in a generic way, JIT back-ends then do not
generate zero extension for sub-register write at default.
However, only those back-ends which do not have hardware zero extension
want this optimization. Back-ends like x86_64 and AArch64 have hardware
zero extension support that the insertion should be disabled.
This patch introduces new target hook "bpf_jit_needs_zext" which returns
false at default, meaning verifier zero extension insertion is disabled at
default. A back-end could override this hook to return true if it doesn't
have hardware support and want verifier insert zero extension explicitly.
Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.
NOTE: arches could have diversified features, it is possible for one arch
to have hardware zero extension support for some sub-register write insns
but not for all. For example, PowerPC, SPARC have zero extended loads, but
not for alu32. So when verifier zero extension insertion enabled, these JIT
back-ends need to peephole insns to remove those zero extension inserted
for insn that actually has hardware zero extension support. The peephole
could be as simple as looking the next insn, if it is a special zero
extension insn then it is safe to eliminate it if the current insn has
hardware zero extension support.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Patched insns do not go through generic verification, therefore doesn't has
zero extension information collected during insn walking.
We don't bother analyze them at the moment, for any sub-register def comes
from them, just conservatively mark it as needing zero extension.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
AArch64 ISA has the same semantics, so the corresponding JIT back-end
doesn't need to do extra work.
However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
(PowerPC, SPARC etc) need to do explicit zero extension to meet this
requirement, otherwise code like the following will fail.
u64_value = (u64) u32_value
... other uses of u64_value
This is because compiler could exploit the semantic described above and
save those zero extensions for extending u32_value to u64_value, these JIT
back-ends are expected to guarantee this through inserting extra zero
extensions which however could be a significant increase on the code size.
Some benchmarks show there could be ~40% sub-register writes out of total
insns, meaning at least ~40% extra code-gen.
One observation is these extra zero extensions are not always necessary.
Take above code snippet for example, it is possible u32_value will never be
casted into a u64, the value of high 32-bit of u32_value then could be
ignored and extra zero extension could be eliminated.
This patch implements this idea, insns defining sub-registers will be
marked when the high 32-bit of the defined sub-register matters. For
those unmarked insns, it is safe to eliminate high 32-bit clearnace for
them.
Algo:
- Split read flags into READ32 and READ64.
- Record index of insn that does sub-register write. Keep the index inside
reg state and update it during verifier insn walking.
- A full register read on a sub-register marks its definition insn as
needing zero extension on dst register.
A new sub-register write overrides the old one.
- When propagating read64 during path pruning, also mark any insn defining
a sub-register that is read in the pruned path as full-register.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch tries to solve the following specific use case.
Currently, bpf program can already collect stack traces
through kernel function get_perf_callchain()
when certain events happens (e.g., cache miss counter or
cpu clock counter overflows). But such stack traces are
not enough for jitted programs, e.g., hhvm (jited php).
To get real stack trace, jit engine internal data structures
need to be traversed in order to get the real user functions.
bpf program itself may not be the best place to traverse
the jit engine as the traversing logic could be complex and
it is not a stable interface either.
Instead, hhvm implements a signal handler,
e.g. for SIGALARM, and a set of program locations which
it can dump stack traces. When it receives a signal, it will
dump the stack in next such program location.
Such a mechanism can be implemented in the following way:
. a perf ring buffer is created between bpf program
and tracing app.
. once a particular event happens, bpf program writes
to the ring buffer and the tracing app gets notified.
. the tracing app sends a signal SIGALARM to the hhvm.
But this method could have large delays and causing profiling
results skewed.
This patch implements bpf_send_signal() helper to send
a signal to hhvm in real time, resulting in intended stack traces.
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
All prune points inside a callee bpf function most likely will have
different callsites. For example, if function foo() is called from
two callsites the half of explored states in all prune points in foo()
will be useless for subsequent walking of one of those callsites.
Fortunately explored_states pruning heuristics keeps the number of states
per prune point small, but walking these states is still a waste of cpu
time when the callsite of the current state is different from the callsite
of the explored state.
To improve pruning logic convert explored_states into hash table and
use simple insn_idx ^ callsite hash to select hash bucket.
This optimization has no effect on programs without bpf2bpf calls
and drastically improves programs with calls.
In the later case it reduces total memory consumption in 1M scale tests
by almost 3 times (peak_states drops from 5752 to 2016).
Care should be taken when comparing the states for equivalency.
Since the same hash bucket can now contain states with different indices
the insn_idx has to be part of verifier_state and compared.
Different hash table sizes and different hash functions were explored,
but the results were not significantly better vs this patch.
They can be improved in the future.
Hit/miss heuristic is not counting index miscompare as a miss.
Otherwise verifier stats become unstable when experimenting
with different hash functions.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
split explored_states into prune_point boolean mark
and link list of explored states.
This removes STATE_LIST_MARK hack and allows marks to be separate from states.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
clean up explored_states to prep for introduction of hashtable
No functional changes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The limit of 1024 subsequent jumps was causing otherwise valid
programs to be rejected. Bump it to 8192 and make the error more verbose.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Here are series of patches that add SPDX tags to different kernel files,
based on two different things:
- SPDX entries are added to a bunch of files that we missed a year ago
that do not have any license information at all.
These were either missed because the tool saw the MODULE_LICENSE()
tag, or some EXPORT_SYMBOL tags, and got confused and thought the
file had a real license, or the files have been added since the last
big sweep, or they were Makefile/Kconfig files, which we didn't
touch last time.
- Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
tools can determine the license text in the file itself. Where this
happens, the license text is removed, in order to cut down on the
700+ different ways we have in the kernel today, in a quest to get
rid of all of these.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on the
patches are reviewers.
The reason for these "large" patches is if we were to continue to
progress at the current rate of change in the kernel, adding license
tags to individual files in different subsystems, we would be finished
in about 10 years at the earliest.
There will be more series of these types of patches coming over the next
few weeks as the tools and reviewers crunch through the more "odd"
variants of how to say "GPLv2" that developers have come up with over
the years, combined with other fun oddities (GPL + a BSD disclaimer?)
that are being unearthed, with the goal for the whole kernel to be
cleaned up.
These diffstats are not small, 3840 files are touched, over 10k lines
removed in just 24 patches.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
1KwtfRnMtG6dk/XaS7H7
=O9lH
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull SPDX update from Greg KH:
"Here is a series of patches that add SPDX tags to different kernel
files, based on two different things:
- SPDX entries are added to a bunch of files that we missed a year
ago that do not have any license information at all.
These were either missed because the tool saw the MODULE_LICENSE()
tag, or some EXPORT_SYMBOL tags, and got confused and thought the
file had a real license, or the files have been added since the
last big sweep, or they were Makefile/Kconfig files, which we
didn't touch last time.
- Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
tools can determine the license text in the file itself. Where this
happens, the license text is removed, in order to cut down on the
700+ different ways we have in the kernel today, in a quest to get
rid of all of these.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on
the patches are reviewers.
The reason for these "large" patches is if we were to continue to
progress at the current rate of change in the kernel, adding license
tags to individual files in different subsystems, we would be finished
in about 10 years at the earliest.
There will be more series of these types of patches coming over the
next few weeks as the tools and reviewers crunch through the more
"odd" variants of how to say "GPLv2" that developers have come up with
over the years, combined with other fun oddities (GPL + a BSD
disclaimer?) that are being unearthed, with the goal for the whole
kernel to be cleaned up.
These diffstats are not small, 3840 files are touched, over 10k lines
removed in just 24 patches"
* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
...
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it would be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 6 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154043.007767574@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not see http www gnu org licenses
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details [based]
[from] [clk] [highbank] [c] you should have received a copy of the
gnu general public license along with this program if not see http
www gnu org licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 355 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not you can access it online at http www gnu
org licenses gpl 2 0 html
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.430943677@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.
1) Use after free in __dev_map_entry_free(), from Eric Dumazet.
2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
Cheng.
3) Orphan NFC, we'll take the patches directly into my tree. From
Johannes Berg.
4) We can't recycle cloned TCP skbs, from Eric Dumazet.
5) Some flow dissector bpf test fixes, from Stanislav Fomichev.
6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.
7) Fix some potential fib6 leaks, from Eric Dumazet.
8) Fix a _decode_session4 uninitialized memory read bug fix that got
lost in a merge. From Florian Westphal.
9) Fix ipv6 source address routing wrt. exception route entries, from
Wei Wang.
10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
driver, fix from Tariq Toukan.
11) Clean up botched merge on netfilter kselftest, from Florian
Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
net: fix kernel-doc warnings for socket.c
net: Treat sock->sk_drops as an unsigned int when printing
kselftests: netfilter: fix leftover net/net-next merge conflict
mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
mlxsw: core: Prevent QSFP module initialization for old hardware
vsock/virtio: Initialize core virtio vsock before registering the driver
net/mlx5e: Fix possible modify header actions memory leak
net/mlx5e: Fix no rewrite fields with the same match
net/mlx5e: Additional check for flow destination comparison
net/mlx5e: Add missing ethtool driver info for representors
net/mlx5e: Fix number of vports for ingress ACL configuration
net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
net/mlx5e: Fix wrong xmit_more application
net/mlx5: Fix peer pf disable hca command
net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
net/mlx5: Add meaningful return codes to status_to_err function
net/mlx5: Imply MLXFW in mlx5_core
Revert "tipc: fix modprobe tipc failed after switch order of device registration"
vsock/virtio: free packets during the socket release
...
Merge yet more updates from Andrew Morton:
"A few final bits:
- large changes to vmalloc, yielding large performance benefits
- tweak the console-flush-on-panic code
- a few fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
panic: add an option to replay all the printk message in buffer
initramfs: don't free a non-existent initrd
fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
mm/vmalloc.c: keep track of free blocks for vmap allocation
Pull IRQ chip updates from Ingo Molnar:
"A late irqchips update:
- New TI INTR/INTA set of drivers
- Rewrite of the stm32mp1-exti driver as a platform driver
- Update the IOMMU MSI mapping API to be RT friendly
- A number of cleanups and other low impact fixes"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
iommu/dma-iommu: Remove iommu_dma_map_msi_msg()
irqchip/gic-v3-mbi: Don't map the MSI page in mbi_compose_m{b, s}i_msg()
irqchip/ls-scfg-msi: Don't map the MSI page in ls_scfg_msi_compose_msg()
irqchip/gic-v3-its: Don't map the MSI page in its_irq_compose_msi_msg()
irqchip/gicv2m: Don't map the MSI page in gicv2m_compose_msi_msg()
iommu/dma-iommu: Split iommu_dma_map_msi_msg() in two parts
genirq/msi: Add a new field in msi_desc to store an IOMMU cookie
arm64: arch_k3: Enable interrupt controller drivers
irqchip/ti-sci-inta: Add msi domain support
soc: ti: Add MSI domain bus support for Interrupt Aggregator
irqchip/ti-sci-inta: Add support for Interrupt Aggregator driver
dt-bindings: irqchip: Introduce TISCI Interrupt Aggregator bindings
irqchip/ti-sci-intr: Add support for Interrupt Router driver
dt-bindings: irqchip: Introduce TISCI Interrupt router bindings
gpio: thunderx: Use the default parent apis for {request,release}_resources
genirq: Introduce irq_chip_{request,release}_resource_parent() apis
firmware: ti_sci: Add helper apis to manage resources
firmware: ti_sci: Add RM mapping table for am654
firmware: ti_sci: Add support for IRQ management
firmware: ti_sci: Add support for RM core ops
...
Currently on panic, kernel will lower the loglevel and print out pending
printk msg only with console_flush_on_panic().
Add an option for users to configure the "panic_print" to replay all
dmesg in buffer, some of which they may have never seen due to the
loglevel setting, which will help panic debugging .
[feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: use logbuf lock to protect the console log index]
Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup fix from Tejun Heo:
"The cgroup2 freezer pulled in this cycle broke strace. This pull
request includes a workaround for the problem.
It's not a complete fix in that it may cause spurious frozen state
flip-flops which is fairly minor. Will push a full fix once it's
ready"
* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
signal: unconditionally leave the frozen state in ptrace_stop()
For iptable module to load a bpf program from a pinned location, it
only retrieve a loaded program and cannot change the program content so
requiring a write permission for it might not be necessary.
Also when adding or removing an unrelated iptable rule, it might need to
flush and reload the xt_bpf related rules as well and triggers the inode
permission check. It might be better to remove the write premission
check for the inode so we won't need to grant write access to all the
processes that flush and restore iptables rules.
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull time fixes from Ingo Molnar:
"A TIA adjtimex interface extension, and a POSIX compliance ABI fix for
timespec64 users"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ntp: Allow TAI-UTC offset to be set to zero
y2038: Make CONFIG_64BIT_TIME unconditional
Pull locking fix from Ingo Molnar:
"A single rwsem fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Prevent decrement of reader count before increment
Alex Xu reported a regression in strace, caused by the introduction of
the cgroup v2 freezer. The regression can be reproduced by stracing
the following simple program:
#include <unistd.h>
int main() {
write(1, "a", 1);
return 0;
}
An attempt to run strace ./a.out leads to the infinite loop:
[ pre-main omitted ]
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[ repeats forever ]
The problem occurs because the traced task leaves ptrace_stop()
(and the signal handling loop) with the frozen bit set. So let's
call cgroup_leave_frozen(true) unconditionally after sleeping
in ptrace_stop().
With this patch applied, strace works as expected:
[ pre-main omitted ]
write(1, "a", 1) = 1
exit_group(0) = ?
+++ exited with 0 +++
Reported-by: Alex Xu <alex_y_xu@yahoo.ca>
Fixes: 76f969e894 ("cgroup: cgroup v2 freezer")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2019-05-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a use after free in __dev_map_entry_free(), from Eric.
2) Several sockmap related bug fixes: a splat in strparser if
it was never initialized, remove duplicate ingress msg list
purging which can race, fix msg->sg.size accounting upon
skb to msg conversion, and last but not least fix a timeout
bug in tcp_bpf_wait_data(), from John.
3) Fix LRU map to avoid messing with eviction heuristics upon
syscall lookup, e.g. map walks from user space side will
then lead to eviction of just recently created entries on
updates as it would mark all map entries, from Daniel.
4) Don't bail out when libbpf feature probing fails. Also
various smaller fixes to flow_dissector test, from Stanislav.
5) Fix missing brackets for BTF_INT_OFFSET() in UAPI, from Gary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Removing of non-DYNAMIC_FTRACE from 32bit x86
- Removing of mcount support from x86
- Emulating a call from int3 on x86_64, fixes live kernel patching
- Consolidated Tracing Error logs file
Minor updates:
- Removal of klp_check_compiler_support()
- kdb ftrace dumping output changes
- Accessing and creating ftrace instances from inside the kernel
- Clean up of #define if macro
- Introduction of TRACE_EVENT_NOP() to disable trace events based on config
options
And other minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXNxMZxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qq4PAP44kP6VbwL8CHyI2A3xuJ6Hwxd+2Z2r
ip66RtzyJ+2iCgEA2QCuWUlEt2bLpF9a8IQ4N9tWenSeW2i7gunPb+tioQw=
=RVQo
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The major changes in this tracing update includes:
- Removal of non-DYNAMIC_FTRACE from 32bit x86
- Removal of mcount support from x86
- Emulating a call from int3 on x86_64, fixes live kernel patching
- Consolidated Tracing Error logs file
Minor updates:
- Removal of klp_check_compiler_support()
- kdb ftrace dumping output changes
- Accessing and creating ftrace instances from inside the kernel
- Clean up of #define if macro
- Introduction of TRACE_EVENT_NOP() to disable trace events based on
config options
And other minor fixes and clean ups"
* tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
x86: Hide the int3_emulate_call/jmp functions from UML
livepatch: Remove klp_check_compiler_support()
ftrace/x86: Remove mcount support
ftrace/x86_32: Remove support for non DYNAMIC_FTRACE
tracing: Simplify "if" macro code
tracing: Fix documentation about disabling options using trace_options
tracing: Replace kzalloc with kcalloc
tracing: Fix partial reading of trace event's id file
tracing: Allow RCU to run between postponed startup tests
tracing: Fix white space issues in parse_pred() function
tracing: Eliminate const char[] auto variables
ring-buffer: Fix mispelling of Calculate
tracing: probeevent: Fix to make the type of $comm string
tracing: probeevent: Do not accumulate on ret variable
tracing: uprobes: Re-enable $comm support for uprobe events
ftrace/x86_64: Emulate call function while updating in breakpoint handler
x86_64: Allow breakpoints to emulate call instructions
x86_64: Add gap to int3 to allow for call emulation
tracing: kdb: Allow ftdump to skip all but the last few entries
tracing: Add trace_total_entries() / trace_total_entries_cpu()
...
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This patch aims to suppress 3 missing-break-in-switch false positives
on some architectures.
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Gustavo A. R. Silva <gustavo@embeddedor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
- a couple of hotfixes
- almost all of the rest of MM
- lib/ updates
- binfmt_elf updates
- autofs updates
- quite a lot of misc fixes and updates
- reiserfs, fatfs
- signals
- exec
- cpumask
- rapidio
- sysctl
- pids
- eventfd
- gcov
- panic
- pps
- gdb script updates
- ipc updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (126 commits)
mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
mm: memcontrol: fix recursive statistics correctness & scalabilty
mm: memcontrol: move stat/event counting functions out-of-line
mm: memcontrol: make cgroup stats and events query API explicitly local
drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl
drivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl
mm, memcg: rename ambiguously named memory.stat counters and functions
arch: remove <asm/sizes.h> and <asm-generic/sizes.h>
treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h>
fs/block_dev.c: Remove duplicate header
fs/cachefiles/namei.c: remove duplicate header
include/linux/sched/signal.h: replace `tsk' with `task'
fs/coda/psdev.c: remove duplicate header
ipc: do cyclic id allocation for the ipc object.
ipc: conserve sequence numbers in ipcmni_extend mode
ipc: allow boot time extension of IPCMNI from 32k to 16M
ipc/mqueue: optimize msg_get()
ipc/mqueue: remove redundant wq task assignment
ipc: prevent lockup on alloc_msg and free_msg
scripts/gdb: print cached rate in lx-clk-summary
...
Allow specifying reboot_mode for panic only. This is needed on systems
where ramoops is used to store panic logs, and user wants to use warm
reset to preserve those, while still having cold reset on normal
reboots.
Link: http://lkml.kernel.org/r/20190322004735.27702-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When kernel panic happens, it will first print the panic call stack,
then the ending msg like:
[ 35.743249] ---[ end Kernel panic - not syncing: Fatal exception
[ 35.749975] ------------[ cut here ]------------
The above message are very useful for debugging.
But if system is configured to not reboot on panic, say the
"panic_timeout" parameter equals 0, it will likely print out many noisy
message like WARN() call stack for each and every CPU except the panic
one, messages like below:
WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 set_task_cpu+0x183/0x190
Call Trace:
<IRQ>
try_to_wake_up
default_wake_function
autoremove_wake_function
__wake_up_common
__wake_up_common_lock
__wake_up
wake_up_klogd_work_func
irq_work_run_list
irq_work_tick
update_process_times
tick_sched_timer
__hrtimer_run_queues
hrtimer_interrupt
smp_apic_timer_interrupt
apic_timer_interrupt
For people working in console mode, the screen will first show the panic
call stack, but immediately overridden by these noisy extra messages,
which makes debugging much more difficult, as the original context gets
lost on screen.
Also these noisy messages will confuse some users, as I have seen many bug
reporters posted the noisy message into bugzilla, instead of the real
panic call stack and context.
Adding a flag "suppress_printk" which gets set in panic() to avoid those
noisy messages, without changing current kernel behavior that both panic
blinking and sysrq magic key can work as is, suggested by Petr Mladek.
To verify this, make sure kernel is not configured to reboot on panic and
in console
# echo c > /proc/sysrq-trigger
to see if console only prints out the panic call stack.
Link: http://lkml.kernel.org/r/1551430186-24169-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
LLVM uses profiling data that's deliberately similar to GCC, but has a
very different way of exporting that data. LLVM calls llvm_gcov_init()
once per module, and provides a couple of callbacks that we can use to
ask for more data.
We care about the "writeout" callback, which in turn calls back into
compiler-rt/this module to dump all the gathered coverage data to disk:
llvm_gcda_start_file()
llvm_gcda_emit_function()
llvm_gcda_emit_arcs()
llvm_gcda_emit_function()
llvm_gcda_emit_arcs()
[... repeats for each function ...]
llvm_gcda_summary_info()
llvm_gcda_end_file()
This design is much more stateless and unstructured than gcc's, and is
intended to run at process exit. This forces us to keep some local
state about which module we're dealing with at the moment. On the other
hand, it also means we don't depend as much on how LLVM represents
profiling data internally.
See LLVM's lib/Transforms/Instrumentation/GCOVProfiling.cpp for more
details on how this works, particularly GCOVProfiler::emitProfileArcs(),
GCOVProfiler::insertCounterWriteout(), and GCOVProfiler::insertFlush().
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190417225328.208129-1-trong@android.com
Signed-off-by: Greg Hackmann <ghackmann@android.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Tri Vo <trong@android.com>
Co-developed-by: Nick Desaulniers <ndesaulniers@google.com>
Co-developed-by: Tri Vo <trong@android.com>
Tested-by: Trilok Soni <tsoni@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Tri Vo <trong@android.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "gcov: add Clang support", v4.
This patch (of 3):
base.c contains a few callbacks specific to GCC's gcov implementation.
Move these into their own module in preparation for Clang support.
Link: http://lkml.kernel.org/r/20190318025411.98014-2-trong@android.com
Signed-off-by: Greg Hackmann <ghackmann@android.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Trilok Soni <tsoni@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Tri Vo <trong@android.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: Petri Gynther <pgynther@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hash functions are not needed since idr is used now. Let's remove hash
header file for cleanup.
Link: http://lkml.kernel.org/r/20190430053319.95913-1-scuttimmy@gmail.com
Signed-off-by: Timmy Li <scuttimmy@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Today, proc_do_large_bitmap() truncates a large write input buffer to
PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
end of the buffer. Further, it fails to notify the caller that the
buffer was truncated, so it doesn't get called iteratively to finish the
entire input buffer.
Tell the caller if there's more work to do by adding the skipped amount
back to left/*lenp before returning.
To fix the misparsing, reset the position if we have completely consumed
a truncated buffer (or if just one char is left, which may be a "-" in a
range), and ask the caller to come back for more.
Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently when userspace gives us a values that overflow e.g. file-max
and other callers of __do_proc_doulongvec_minmax() we simply ignore the
new value and leave the current value untouched.
This can be problematic as it gives the illusion that the limit has
indeed be bumped when in fact it failed. This commit makes sure to
return EINVAL when an overflow is detected. Please note that this is a
userspace facing change.
Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.
Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).
This commit remove the following warning:
kernel/signal.c:795:13: warning: this statement may fall through [-Wimplicit-fallthrough=]
Link: http://lkml.kernel.org/r/20190114203505.17875-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The out_unlock label is misleading; no unlocking happens after it, so
just return NULL directly.
Also, nothing between the kmem_cache_zalloc() that creates new and the
two key_put() can initialize new->uid_keyring or new->session_keyring,
so those calls are no-ops.
Link: http://lkml.kernel.org/r/20190424200404.9114-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.
Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. In latencytop source codes, we only have such calling chain:
account_scheduler_latency(struct task_struct *task, int usecs, int inter)
{
if (unlikely(latencytop_enabled)) /* the outtermost check */
__account_scheduler_latency(task, usecs, inter);
}
__account_scheduler_latency
account_global_scheduler_latency
if (!latencytop_enabled)
So, the inner check for latencytop_enabled is not necessary at all.
2. In clear_all_latency_tracing and now is called
clear_tsk_latency_tracing the check for latencytop_enabled is redundant
and buggy to some extent.
We have no reason to refuse clearing the /proc/$pid/latency if
latencytop_enabled is set to 0, considering that if we use latencytop
manually by echo 0 > /proc/sys/kernel/latencytop, then we want to clear
/proc/$pid/latency and failed.
Also we don't have such check in brother function
clear_global_latency_tracing.
Notes: These changes are only visible to users who set
CONFIG_LATENCYTOP and won't change user tool latencytop's behavior.
Link: http://lkml.kernel.org/r/20190226114602.16902-2-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By design notifiers can be registerd once only, 2nd register attempt
called by mistake silently corrupts notifiers list.
A few years ago I investigated described problem, the host was power
cycled because of notifier list corruption. I've prepared this patch
and applied it to the OpenVZ kernel and sent this patch but nobody
commented on it. Later it helped us to detect a similar problem in the
OpenVz kernel.
Mistakes with notifier registration can happen for example during
subsystem initialization from different namespaces, or because of a lost
unregister in the roll-back path on initialization failures.
The proposed check cannot prevent the described problem, however it
allows us to detect its reason quickly without coredump analysis.
Link: http://lkml.kernel.org/r/04127e71-4782-9bbb-fe5a-7c01e93a99b0@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pressure metrics are already recorded and exposed in procfs for the
entire system, but any tool which monitors cgroup pressure has to
special case the root cgroup to read from procfs. This patch exposes
the already recorded pressure metrics on the root cgroup.
Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.com
Signed-off-by: Dan Schatzberg <dschatzberg@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users. It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.
Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.
Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window. Min window size is 500ms, therefore the min
monitoring interval is 50ms. Max window size is 10s with monitoring
interval of 1s.
When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.
Notifications to the users are rate-limited to one per tracking window.
Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and
complaining on kthread structures not being defined because they are
defined further in the kthread.h. Resolve this by removing psi_types.h
inclusion from the headers included from kthread.h.
Link: http://lkml.kernel.org/r/20190319235619.260832-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.
This will be needed to detect whether polled states activated in the
monitor patch.
Link: http://lkml.kernel.org/r/20190319235619.260832-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and for trigger-related
fields that will be added by "psi: introduce psi monitor".
[surenb@google.com: v6]
Link: http://lkml.kernel.org/r/20190319235619.260832-4-surenb@google.com
Link: http://lkml.kernel.org/r/20190124211518.244221-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
psi_enable is not used outside of psi.c, make it static.
Link: http://lkml.kernel.org/r/20190319235619.260832-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "psi: pressure stall monitors", v6.
This is a respin of:
https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/
Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.
Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high-enough frequency right now.
This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to be
notified when these are breached.
As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.
With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user. For
example, using memory stall monitors in userspace low memory killer
daemon (lmkd) we can detect mounting pressure and kill less important
processes before device becomes visibly sluggish. In our memory stress
testing psi memory monitors produce roughly 10x less false positives
compared to vmpressure signals. Having ability to specify multiple
triggers for the same psi metric allows other parts of Android framework
to monitor memory state of the device and act accordingly.
The new interface is straight-forward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:
/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 100000 1000000"
fd = open("/proc/pressure/memory")
write(fd, trigger, sizeof(trigger))
while (poll() >= 0) {
...
};
close(fd);
When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in order
to emit event signals in a timely fashion. Once the stalling subsides,
aggregation reverts back to normal.
The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.
Patches 1-6 prepare the psi code for polling support. Patch 7
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.
The patches were developed in collaboration with Johannes Weiner.
This patch (of 7):
The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.
This adds 4-byte state_mask member into psi_group_cpu struct which
results in its first cacheline-aligned part becoming 52 bytes long. Add
explicit values to enumeration element counters that affect
psi_group_cpu struct size.
Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>