Pull namespace fix from Eric Biederman:
"This has a single brown bag fix.
The possible deadlock with dec_pid_namespaces that I had thought was
fixed earlier turned out only to have been moved. So instead of being
cleaver this change takes ucounts_lock with irqs disabled. So
dec_ucount can be used from any context without fear of deadlock.
The items accounted for dec_ucount and inc_ucount are all
comparatively heavy weight objects so I don't exepct this will have
any measurable performance impact"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Make ucounts lock irq-safe
William reported couple of issues in relation to direct packet
access. Typical scheme is to check for data + [off] <= data_end,
where [off] can be either immediate or coming from a tracked
register that contains an immediate, depending on the branch, we
can then access the data. However, in case of calculating [off]
for either the mentioned test itself or for access after the test
in a more "complex" way, then the verifier will stop tracking the
CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.
Adding that UNKNOWN_VALUE typed register to a pkt() marked
register, the verifier then bails out in check_packet_ptr_add()
as it finds the registers imm value below 48. In the first below
example, that is due to evaluate_reg_imm_alu() not handling right
shifts and thus marking the register as UNKNOWN_VALUE via helper
__mark_reg_unknown_value() that resets imm to 0.
In the second case the same happens at the time when r4 is set
to r4 &= r5, where it transitions to UNKNOWN_VALUE from
evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
evaluate_reg_alu(), where the register's imm turns into 3. That
is, for registers with type UNKNOWN_VALUE, imm of 0 means that
we don't know what value the register has, and for imm > 0 it
means that the value has [imm] upper zero bits. F.e. when shifting
an UNKNOWN_VALUE register by 3 to the right, no matter what value
it had, we know that the 3 upper most bits must be zero now.
This is to make sure that ALU operations with unknown registers
don't overflow. Meaning, once we know that we have more than 48
upper zero bits, or, in other words cannot go beyond 0xffff offset
with ALU ops, such an addition will track the target register
as a new pkt() register with a new id, but 0 offset and 0 range,
so for that a new data/data_end test will be required. Is the source
register a CONST_IMM one that is to be added to the pkt() register,
or the source instruction is an add instruction with immediate
value, then it will get added if it stays within max 0xffff bounds.
>From there, pkt() type, can be accessed should reg->off + imm be
within the access range of pkt().
[...]
from 28 to 30: R0=imm1,min_value=1,max_value=1
R1=pkt(id=0,off=0,r=22) R2=pkt_end
R3=imm144,min_value=144,max_value=144
R4=imm0,min_value=0,max_value=0
R5=inv48,min_value=2054,max_value=2054 R10=fp
30: (bf) r5 = r3
31: (07) r5 += 23
32: (77) r5 >>= 3
33: (bf) r6 = r1
34: (0f) r6 += r5
cannot add integer value with 0 upper zero bits to ptr_to_packet
[...]
from 52 to 80: R0=imm1,min_value=1,max_value=1
R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
R4=imm272 R5=inv56,min_value=17,max_value=17
R6=pkt(id=0,off=26,r=34) R10=fp
80: (07) r4 += 71
81: (18) r5 = 0xfffffff8
83: (5f) r4 &= r5
84: (77) r4 >>= 3
85: (0f) r1 += r4
cannot add integer value with 3 upper zero bits to ptr_to_packet
Thus to get above use-cases working, evaluate_reg_imm_alu() has
been extended for further ALU ops. This is fine, because we only
operate strictly within realm of CONST_IMM types, so here we don't
care about overflows as they will happen in the simulated but also
real execution and interaction with pkt() in check_packet_ptr_add()
will check actual imm value once added to pkt(), but it's irrelevant
before.
With regards to 06c1c04972 ("bpf: allow helpers access to variable
memory") that works on UNKNOWN_VALUE registers, the verifier becomes
now a bit smarter as it can better resolve ALU ops, so we need to
adapt two test cases there, as min/max bound tracking only becomes
necessary when registers were spilled to stack. So while mask was
set before to track upper bound for UNKNOWN_VALUE case, it's now
resolved directly as CONST_IMM, and such contructs are only necessary
when f.e. registers are spilled.
For commit 6b17387307 ("bpf: recognize 64bit immediate loads as
consts") that initially enabled dw load tracking only for nfp jit/
analyzer, I did couple of tests on large, complex programs and we
don't increase complexity badly (my tests were in ~3% range on avg).
I've added a couple of tests similar to affected code above, and
it works fine with verifier now.
Reported-by: William Tu <u9012063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Gianluca Borello <g.borello@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to initialize im_node to NULL, otherwise in case of error path
it gets passed to kfree() as uninitialized pointer.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.
Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.
The code carries more information about the internal implementation.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull smp/hotplug fix from Thomas Gleixner:
"Remove an unused variable which is a leftover from the notifier
removal"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove unused but set variable in _cpu_down()
Provide a simple helper with the same semantics of strncpy_from_unsafe():
int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)
This gives more flexibility to a bpf program. A typical use case is
intercepting a file name during sys_open(). The current approach is:
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
char buf[PATHLEN]; // PATHLEN is defined to 256
bpf_probe_read(buf, sizeof(buf), ctx->di);
/* consume buf */
}
This is suboptimal because the size of the string needs to be estimated
at compile time, causing more memory to be copied than often necessary,
and can become more problematic if further processing on buf is done,
for example by pushing it to userspace via bpf_perf_event_output(),
since the real length of the string is unknown and the entire buffer
must be copied (and defining an unrolled strnlen() inside the bpf
program is a very inefficient and unfeasible approach).
With the new helper, the code can easily operate on the actual string
length rather than the buffer size:
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
char buf[PATHLEN]; // PATHLEN is defined to 256
int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);
/* consume buf, for example push it to userspace via
* bpf_perf_event_output(), but this time we can use
* res (the string length) as event size, after checking
* its boundaries.
*/
}
Another useful use case is when parsing individual process arguments or
individual environment variables navigating current->mm->arg_start and
current->mm->env_start: using this helper and the return value, one can
quickly iterate at the right offset of the memory area.
The code changes simply leverage the already existent
strncpy_from_unsafe() kernel function, which is safe to be called from a
bpf program as it is used in bpf_trace_printk().
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert commit 08b98d3291 (PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0
flag) as it caused system suspend (in the default configuration) to fail
on Dell XPS13 (9360) with the Kaby Lake processor.
Fixes: 08b98d3291 (PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag)
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.
Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.
Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.
Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SMP hotplug update from Thomas Gleixner:
"This contains a trivial typo fix and an extension to the core code for
dynamically allocating states in the prepare stage.
The extension is necessary right now because we need a proper way to
unbreak LTTNG, which iscurrently non functional due to the removal of
the notifiers. Surely it's out of tree, but it's widely used by
distros.
The simple solution would have been to reserve a state for LTTNG, but
I'm not fond about unused crap in the kernel and the dynamic range,
which we admittedly should have done right away, allows us to remove
quite some of the hardcoded states, i.e. those which have no ordering
requirements. So doing the right thing now is better than having an
smaller intermediate solution which needs to be reworked anyway"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Provide dynamic range for prepare stage
perf/x86/amd/ibs: Fix typo after cleanup state names in cpu/hotplug
Pull RCU fixes from Ingo Molnar:
"This fixes sporadic ACPI related hangs in synchronize_rcu() that were
caused by the ACPI code mistakenly relying on an aspect of RCU that
was neither promised to work nor reliable but which happened to work -
until in v4.9 we changed the RCU implementation, which made the hangs
more prominent.
Since the mis-use of the RCU facility wasn't properly detected and
prevented either, these fixes make the RCU side work reliably instead
of working around the problem in the ACPI code.
Hence the slightly larger diffstat that goes beyond the normal scope
of RCU fixes in -rc kernels"
* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Narrow early boot window of illegal synchronous grace periods
rcu: Remove cond_resched() from Tiny synchronize_sched()
After the recent removal of the hotplug notifiers the variable 'hasdied' in
_cpu_down() is set but no longer read, leading to the following GCC warning
when building with 'make W=1':
kernel/cpu.c:767:7: warning: variable ‘hasdied’ set but not used [-Wunused-but-set-variable]
Fix it by removing the variable.
Fixes: 530e9b76ae ("cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20170117143501.20893-1-tklauser@distanz.ch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- Fix out-of-tree module breakage when it supplies its own
definitions of true and false
Signed-off-by: Jessica Yu <jeyu@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYfm7sAAoJEMBFfjjOO8FyrmMP/j1Sa179+uBWWPE0Td7ip6yj
EgvOtZGcnZfMuHbs5Evn8Fnz5K3of3IriiJNPePuQPu/YnoidltxkWOMXYkCj+Fn
acW8VRtrh2urec70gRapuTmSpxs1I/XLUdNG+Ozm0FFX+L+k0ydCqEPGuVkwyHNK
Wn31lVTiqx+zWm5PAJBzD6dEchQ0h2uppHRmZ+mIn3GyvYavIGnMMkdjqEEq9v8w
UYdw52AJFAGMDO8LoSihX5cFbe0E28A58jkJuJ5AKXglaY6Nvl2xWOxfLhFnxO1m
7KFuf+q2YO10hoJtdItEmPw2iC8pIgoAUGpZ+4h0iSWxyUC5V4QEmrhe4q9CtOLD
+dfcd+44UekvWiWL4AQUO6IsUzIo8UqsJYf4Tic4/EjAKZtGTseKjQqCgBv3kJA+
nN3hJ9gMN4NZWOMLihSn7Ml/whrxchdqlEP520nzGTnWUaLOUPp4XhfNlDaAH58K
WfxiT0L6w+Cbg3xMCZRxQyqlJWWw8x1CM7B6eScHvN67TulC2enIQYTkv6eDOzQX
DPz4lvcFisjASFP+i+3ouYL2pfLnm/IUG9K1wieqBvPHEdeZBuGr7+VEHjFmhhmG
f3kKvYsRgUQF8tKeGtI0uPxnZNw4z4QaYOVKf8bISzpsIHOezdaouOm+KGUctHqO
DNIWMf34W7fE5AVrGKp6
=ybIc
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules fix from Jessica Yu:
- fix out-of-tree module breakage when it supplies its own definitions
of true and false
* tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
taint/module: Fix problems when out-of-kernel driver defines true or false
Commit 7fd8329ba5 ("taint/module: Clean up global and module taint
flags handling") used the key words true and false as character members
of a new struct. These names cause problems when out-of-kernel modules
such as VirtualBox include their own definitions of true and false.
Fixes: 7fd8329ba5 ("taint/module: Clean up global and module taint flags handling")
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
Pull networking fixes from David Miller:
1) Handle multicast packets properly in fast-RX path of mac80211, from
Johannes Berg.
2) Because of a logic bug, the user can't actually force SW
checksumming on r8152 devices. This makes diagnosis of hw
checksumming bugs really annoying. Fix from Hayes Wang.
3) VXLAN route lookup does not take the source and destination ports
into account, which means IPSEC policies cannot be matched properly.
Fix from Martynas Pumputis.
4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.
5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.
6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
construction properly in fib_dump_info(), from David Ahern.
7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
Schmidt.
8) Openvswitch conntack actions need to maintain a correct checksum,
fix from Lance Richardson.
9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
fact it already checks this, but not in all of the necessary spots.
Fix from Basil Gunn.
10) Action GET operations in the packet scheduler can erroneously bump
the reference count of the entry, making it unreleasable. Fix from
Jamal Hadi Salim. Jamal gives a great set of example command lines
that trigger this in the commit message.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
net sched actions: fix refcnt when GETing of action after bind
net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
net/mlx4_core: Fix racy CQ (Completion Queue) free
net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
net/mlx5e: Fix a -Wmaybe-uninitialized warning
ax25: Fix segfault after sock connection timeout
bpf: rework prog_digest into prog_tag
tipc: allocate user memory with GFP_KERNEL flag
net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
ip6_tunnel: Account for tunnel header in tunnel MTU
mld: do not remove mld souce list info when set link down
be2net: fix MAC addr setting on privileged BE3 VFs
be2net: don't delete MAC on close on unprivileged BE3 VFs
be2net: fix status check in be_cmd_pmac_add()
cpmac: remove hopeless #warning
ravb: do not use zero-length alignment DMA descriptor
mlx4: do not call napi_schedule() without care
openvswitch: maintain correct checksum state in conntrack actions
tcp: fix tcp_fastopen unaligned access complaints on sparc
...
Make sure that ctx cannot potentially be accessed oob by asserting
explicitly that ctx access size into pt_regs for BPF_PROG_TYPE_KPROBE
programs must be within limits. In case some 32bit archs have pt_regs
not being a multiple of 8, then BPF_DW access could cause such access.
BPF_PROG_TYPE_KPROBE progs don't have a ctx conversion function since
there's no extra mapping needed. kprobe_prog_is_valid_access() didn't
enforce sizeof(long) as the only allowed access size, since LLVM can
generate non BPF_W/BPF_DW access to regs from time to time.
For BPF_PROG_TYPE_TRACEPOINT we don't have a ctx conversion either, so
add a BUILD_BUG_ON() check to make sure that BPF_DW access will not be
a similar issue in future (ctx works on event buffer as opposed to
pt_regs there).
Fixes: 2541517c32 ("tracing, perf: Implement BPF programs attached to kprobes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7bd509e311 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.
The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.
Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.
Fixes: 7bd509e311 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
the removal of the cpu hotplug notifiers.
Usually I don't care much about out of tree modules, but LTTNG is widely
used in distros. There are two ways to solve that:
1) Reserve a hotplug state for LTTNG
2) Add a dynamic range for the prepare states.
While #1 is the simplest solution, #2 is the proper one as we can convert
in tree users, which do not care about ordering, to the dynamic range as
well.
Add a dynamic range which allows LTTNG to request states in the prepare
stage.
Reported-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull an urgent RCU fix from Paul E. McKenney:
"This series contains a pair of commits that permit RCU synchronous grace
periods (synchronize_rcu() and friends) to work correctly throughout boot.
This eliminates the current "dead time" starting when the scheduler spawns
its first taks and ending when the last of RCU's kthreads is spawned
(this last happens during early_initcall() time). Although RCU's
synchronous grace periods have long been documented as not working
during this time, prior to 4.9, the expedited grace periods worked by
accident, and some ACPI code came to rely on this unintentional behavior.
(Note that this unintentional behavior was -not- reliable. For example,
failures from ACPI could occur on !SMP systems and on systems booting
with the rcu_normal kernel boot parameter.)
Either way, there is a bug that needs fixing, and the 4.9 switch of RCU's
expedited grace periods to workqueues could be considered to have caused
a regression. This series therefore makes RCU's expedited grace periods
operate correctly throughout the boot process. This has been demonstrated
to fix the problems ACPI was encountering, and has the added longer-term
benefit of simplifying RCU's behavior."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull namespace fixes from Eric Biederman:
"This tree contains 4 fixes.
The first is a fix for a race that can causes oopses under the right
circumstances, and that someone just recently encountered.
Past that are several small trivial correct fixes. A real issue that
was blocking development of an out of tree driver, but does not appear
to have caused any actual problems for in-tree code. A potential
deadlock that was reported by lockdep. And a deadlock people have
experienced and took the time to track down caused by a cleanup that
removed the code to drop a reference count"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sysctl: Drop reference added by grab_header in proc_sys_readdir
pid: fix lockdep deadlock warning due to ucount_lock
libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
mnt: Protect the mountpoint hashtable with mount_lock
Pull NOHZ fix from Ingo Molnar:
"This fixes an old NOHZ race where we incorrectly calculate the next
timer interrupt in certain circumstances where hrtimers are pending,
that can cause hard to reproduce stalled-values artifacts in
/proc/stat"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix collision between tick and other hrtimers
Pull perf fixes from Ingo Molnar:
"Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
driver fixes, plus miscellanous tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Reject non sampling events with precise_ip
perf/x86/intel: Account interrupts for PEBS errors
perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
perf/core: Fix sys_perf_event_open() vs. hotplug
perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
perf/x86: Set pmu->module in Intel PMU modules
perf probe: Fix to probe on gcc generated symbols for offline kernel
perf probe: Fix --funcs to show correct symbols for offline module
perf symbols: Robustify reading of build-id from sysfs
perf tools: Install tools/lib/traceevent plugins with install-bin
tools lib traceevent: Fix prev/next_prio for deadline tasks
perf record: Fix --switch-output documentation and comment
perf record: Make __record_options static
tools lib subcmd: Add OPT_STRING_OPTARG_SET option
perf probe: Fix to get correct modname from elf header
samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
samples/bpf sock_example: Avoid getting ethhdr from two includes
perf sched timehist: Show total scheduling time
The current preemptible RCU implementation goes through three phases
during bootup. In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running. During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending). In the third and final phase, RCU is fully operational and
everything works normally.
This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter). The callchain from the failure case is as follows:
early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited
The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.
This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase. This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.
This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path. The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.
Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.
Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stan Kain <stan.kain@gmail.com>
Tested-by: Ivan <waffolz@hotmail.com>
Tested-by: Emanuel Castelo <emanuel.castelo@gmail.com>
Tested-by: Bruno Pesavento <bpesavento@infinito.it>
Tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Frederic Bezies <fredbezies@gmail.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
It is now legal to invoke synchronize_sched() at early boot, which causes
Tiny RCU's synchronize_sched() to emit spurious splats. This commit
therefore removes the cond_resched() from Tiny RCU's synchronize_sched().
Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:
taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:
NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
...
RIP: 0010:[<ffffffff81159232>] [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
...
Call Trace:
? trace_hardirqs_on_caller+0xf5/0x1b0
? perf_cgroup_attach+0x70/0x70
perf_install_in_context+0x199/0x1b0
? ctx_resched+0x90/0x90
SYSC_perf_event_open+0x641/0xf90
SyS_perf_event_open+0x9/0x10
do_syscall_64+0x6c/0x1f0
entry_SYSCALL64_slow_path+0x25/0x25
Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.
We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.
The problem is exactly that described in commit:
f63a8daa58 ("perf: Fix event->ctx locking")
... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.
That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.
So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).
Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.
Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias <joaodias@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Min Chong <mchong@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f63a8daa58 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is problem with installing an event in a task that is 'stuck' on
an offline CPU.
Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.
If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.
While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.
Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':
CPU0 CPU1 CPU2
(current == t)
t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);
switch(t, n);
migrate(t, 2);
switch(p, t);
ctx = t->perf_event_ctxp[]; // must not be NULL
smp_function_call(cpu, ..);
generic_exec_single()
func();
spin_lock(ctx->lock);
if (task_curr(t)) // false
add_event_to_ctx();
spin_unlock(ctx->lock);
perf_event_context_sched_in();
spin_lock(ctx->lock);
// sees event
So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.
As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.
I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:
C C-peterz
{
}
P0(int *x, int *y)
{
int r1;
WRITE_ONCE(*x, 1);
smp_mb();
r1 = READ_ONCE(*y);
}
P1(int *y, int *z)
{
WRITE_ONCE(*y, 1);
smp_store_release(z, 1);
}
P2(int *x, int *z)
{
int r1;
int r2;
r1 = smp_load_acquire(z);
smp_mb();
r2 = READ_ONCE(*x);
}
exists
(0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
Where:
x is perf_event_ctxp[],
y is our tasks's CPU, and
z is our task being placed on the rq of CPU2.
The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().
The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.
This litmus test evaluates into:
Test C-peterz Allowed
States 7
0:r1=0; 2:r1=0; 2:r2=0;
0:r1=0; 2:r1=0; 2:r2=1;
0:r1=0; 2:r1=1; 2:r2=1;
0:r1=1; 2:r1=0; 2:r2=0;
0:r1=1; 2:r1=0; 2:r2=1;
0:r1=1; 2:r1=1; 2:r2=0;
0:r1=1; 2:r1=1; 2:r2=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
Observation C-peterz Never 0 7
Hash=e427f41d9146b2a5445101d3e2fcaa34
And the strong and weak model agree.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)
- Export and make use of has_capability() to fix incorrect use of
ns_capable() for testing task capabilities (Jike Song)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJYeTWfAAoJECObm247sIsi1PEP/0lIkIQWBUlVWC1QA6bJN0Xx
9c4pA34kLJwCpEtEoPxf6owjgK7kSBgIUUqBaNNDdKZQYttGgA+qiX3HhuvEigKL
vEq5/TqwL6vv2aIUp/5uPP4NNTJD8RynwkfDI1B8DVQN6E1GM2zozpFUiZbDUxz/
sgIuby9nuG3WTVLgOVayyMHlPTXG1+l+quRlAhMAseD7LMx7q/71NIjKggSUFRQG
fkOVVTqfCnLJmIyq/cWbJt2cDgeWQq2/Ik6gje3SiOFtxi8fRdlzONUL+tHM1KgT
r0htrq+r3B7BxI0CMZuoHIBt1SK443yu39xDzb0iXDSb5W9gwR14uFMuXv1ftfM0
qkZnvpsXaT6wpKvK2ztmHgUiKJmOTgYrG77Dhz4oz6Mm0Y1mn6bV4yueoF/rQIn0
GrM1Af/SVLf3Vhxw6i5a1s7kDgpySw8FfucKO5Xv3cOaIgNtlrrjxbKKa9DZ3wd7
mnjD30XHwxEim8OCgv7CFswPsc5TiqYJTKGbnSJGo67ZCXWxXFHLIab0cn5yMd8G
Qgw4mLnIv2rkRZOWpgMy4PedCNjZXNuQbW3I90kDb/VlPvRdCqUIsO0Ty10yaNhe
s8Gwmxphoi3U/J7Y4T/BsfkCZ4Umut9gAt/WsG4kgWj3v0FOmxLgl39lC0cRigR6
l7HSf0fOg/D9k6EN1xnc
=yeS9
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio
Pull VFIO fixes from Alex Williamson:
- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)
- Export and make use of has_capability() to fix incorrect use of
ns_capable() for testing task capabilities (Jike Song)
* tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
vfio/type1: Remove pid_namespace.h include
vfio iommu type1: fix the testing of capability for remote task
capability: export has_capability
vfio-mdev: remove some dead code
vfio-mdev: buffer overflow in ioctl()
vfio-mdev: return -EFAULT if copy_to_user() fails
other buggy modules!)
* two NULL pointer dereferences from syzkaller
* CVE from syzkaller, very serious on 4.10-rc, "just" kernel memory
leak on releases
* CVE from security@kernel.org, somewhat serious on AMD, less so on
Intel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJYd7l5AAoJEL/70l94x66DLWYH/0GUg+lK9J/gj0kwqi6BwsOP
Rrs5Y7XvyNLsy/piBrrHDHvRa+DfAkrU8nepwgygX/yuGmSDV/zmdIb8XA/dvKht
MN285NFlVjTyznYlU/LH3etx11CHLMNclishiFHQbcnohtvhOe+fvN6RVNdfeRxm
d9iBPOum15ikc1xDl2z8Op+ZXVjMxkgLkzIXFcDBpJf4BvUx0X+ZHZXIKdizVhgU
ZMD2ds/MutMB8X1A52qp6kQvT7xE4rp87M0So4qDMTbAto5G4ZmMaWC5MlK2Oxe/
o+3qnx4vVz4H6uYzg1N4diHiC+buhgtXCLwwkcUOKKUVqJRP9e0Bh7kw8JA52XU=
=C+tM
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
- fix for module unload vs deferred jump labels (note: there might be
other buggy modules!)
- two NULL pointer dereferences from syzkaller
- also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
made worse during this merge window, "just" kernel memory leak on
releases
- fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix emulation of "MOV SS, null selector"
KVM: x86: fix NULL deref in vcpu_scan_ioapic
KVM: eventfd: fix NULL deref irqbypass consumer
KVM: x86: Introduce segmented_write_std
KVM: x86: flush pending lapic jump label updates on module unload
jump_labels: API for flushing deferred jump label updates
When structs are used to store temporary state in cb[] buffer that is
used with programs and among tail calls, then the generated code will
not always access the buffer in bpf_w chunks. We can ease programming
of it and let this act more natural by allowing for aligned b/h/w/dw
sized access for cb[] ctx member. Various test cases are attached as
well for the selftest suite. Potentially, this can also be reused for
other program types to pass data around.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when calling convert_ctx_access() callback for the various
program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
the original instruction. This information is needed to rewrite the
instruction that is based on the user ctx structure into a kernel
representation for the ctx. As we'd like to allow access size beyond
just BPF_W, we'd need also insn->code for that in order to decode the
original access size. Given that, lets just pass insn directly to the
convert_ctx_access() callback and work on that to not clutter the
callback with even more arguments we need to pass when everything is
already contained in insn. So lets go through that once, no functional
change.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
has_capability() is sometimes needed by modules to test capability
for specified task other than current, so export it.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.
Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.
In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.
Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.
In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:
Task in a full dynticks CPU
----------------------------
* hrtimer A is queued 2 seconds ahead
* the tick is stopped, scheduled 1 second ahead
* tick fires 1 second later
* on tick exit, nohz schedules the tick 1 second ahead but sees
the clockevent device is already programmed to that deadline,
fooled by hrtimer A, the tick isn't rescheduled.
* hrtimer A is cancelled before its deadline
* tick never fires again until an interrupt happens...
In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.
On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.
Reported-by: James Hartsock <hartsjc@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since commit 00cd5c37af ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.
This can result in init becoming stoppable/killable after tracing. For
example, running:
while true; do kill -STOP 1; done &
strace -p 1
and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.
Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.
Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both arch_add_memory() and arch_remove_memory() expect a single threaded
context.
For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
not hold any locks over this check and branch:
if (pgd_val(*pgd)) {
pud = (pud_t *)pgd_page_vaddr(*pgd);
paddr_last = phys_pud_init(pud, __pa(vaddr),
__pa(vaddr_end),
page_size_mask);
continue;
}
pud = alloc_low_page();
paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
page_size_mask);
The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization. This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:
BUG: unable to handle kernel paging request at ffff888ebfff0000
IP: memcpy_erms+0x6/0x10
PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
RIP: memcpy_erms+0x6/0x10
[..]
Call Trace:
? pmem_do_bvec+0x205/0x370 [nd_pmem]
? blk_queue_enter+0x3a/0x280
pmem_rw_page+0x38/0x80 [nd_pmem]
bdev_read_page+0x84/0xb0
Hold the standard memory hotplug mutex over calls to
arch_{add,remove}_memory().
Fixes: 41e94a8513 ("add devm_memremap_pages")
Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 01b3f52157 ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.
While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.
The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".
Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the functions __local_list_pop_free(), __local_list_pop_pending(),
bpf_common_lru_populate() and bpf_percpu_lru_populate() static as they
are not used outide of bpf_lru_list.c
This fixes the following GCC warnings when building with 'W=1':
kernel/bpf/bpf_lru_list.c:363:22: warning: no previous prototype for ‘__local_list_pop_free’ [-Wmissing-prototypes]
kernel/bpf/bpf_lru_list.c:376:22: warning: no previous prototype for ‘__local_list_pop_pending’ [-Wmissing-prototypes]
kernel/bpf/bpf_lru_list.c:560:6: warning: no previous prototype for ‘bpf_common_lru_populate’ [-Wmissing-prototypes]
kernel/bpf/bpf_lru_list.c:577:6: warning: no previous prototype for ‘bpf_percpu_lru_populate’ [-Wmissing-prototypes]
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unused but set variable 'first_node' in
__bpf_lru_list_shrink_inactive() to fix the following GCC warning when
building with 'W=1':
kernel/bpf/bpf_lru_list.c:216:41: warning: variable ‘first_node’ set but not used [-Wunused-but-set-variable]
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G W
---------------------------------------------------------
swapper/1/0 just changed the state of lock:
(&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
but this lock took another, HARDIRQ-unsafe lock in the past:
(ucounts_lock){+.+...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of: &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(ucounts_lock);
local_irq_disable();
lock(&(&sighand->siglock)->rlock);
lock(&(&tty->ctrl_lock)->rlock);
<Interrupt>
lock(&(&sighand->siglock)->rlock);
*** DEADLOCK ***
This patch removes a dependency between rlock and ucount_lock.
Fixes: f333c700c6 ("pidns: Add a limit on the number of pid namespaces")
Cc: stable@vger.kernel.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
since ARG_PTR_TO_STACK is no longer just pointer to stack
rename it to ARG_PTR_TO_MEM and adjust comment.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, helpers that read and write from/to the stack can do so using
a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
that the verifier can safely check the memory access. However, requiring
the argument to be a constant can be limiting in some circumstances.
Since the current logic keeps track of the minimum and maximum value of
a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
be changed to also accept an UNKNOWN_VALUE register in case its
boundaries have been set and the range doesn't cause invalid memory
accesses.
One common situation when this is useful:
int len;
char buf[BUFSIZE]; /* BUFSIZE is 128 */
if (some_condition)
len = 42;
else
len = 84;
some_helper(..., buf, len & (BUFSIZE - 1));
The compiler can often decide to assign the constant values 42 or 48
into a variable on the stack, instead of keeping it in a register. When
the variable is then read back from stack into the register in order to
be passed to the helper, the verifier will not be able to recognize the
register as constant (the verifier is not currently tracking all
constant writes into memory), and the program won't be valid.
However, by allowing the helper to accept an UNKNOWN_VALUE register,
this program will work because the bitwise AND operation will set the
range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
so the verifier can guarantee the helper call will be safe (assuming the
argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
check against 0 would be needed). Custom ranges can be set not only with
ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
register with constants.
Another very common example happens when intercepting system call
arguments and accessing user-provided data of variable size using
bpf_probe_read(). One can load at runtime the user-provided length in an
UNKNOWN_VALUE register, and then read that exact amount of data up to a
compile-time determined limit in order to fit into the proper local
storage allocated on the stack, without having to guess a suboptimal
access size at compile time.
Also, in case the helpers accepting the UNKNOWN_VALUE register operate
in raw mode, disable the raw mode so that the program is required to
initialize all memory, since there is no guarantee the helper will fill
it completely, leaving possibilities for data leak (just relevant when
the memory used by the helper is the stack, not when using a pointer to
map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
be treated as ARG_PTR_TO_STACK.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 484611357c ("bpf: allow access into map value arrays")
introduces the ability to do pointer math inside a map element value via
the PTR_TO_MAP_VALUE_ADJ register type.
The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
is spilled into the stack, limiting several use cases, especially when
generating bpf code from a compiler.
Handle this case by explicitly enabling the register type
PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
max_value are reset just for BPF_LDX operations that don't result in a
restore of a spilled register from stack.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable helpers to directly access a map element value by passing a
register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.
This enables several use cases. For example, a typical tracing program
might want to capture pathnames passed to sys_open() with:
struct trace_data {
char pathname[PATHLEN];
};
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
struct trace_data data;
bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);
/* consume data.pathname, for example via
* bpf_trace_printk() or bpf_perf_event_output()
*/
}
Such a program could easily hit the stack limit in case PATHLEN needs to
be large or more local variables need to exist, both of which are quite
common scenarios. Allowing direct helper access to map element values,
one could do:
struct bpf_map_def SEC("maps") scratch_map = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(struct trace_data),
.max_entries = 1,
};
SEC("kprobe/sys_open")
int bpf_sys_open(struct pt_regs *ctx)
{
int id = 0;
struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
if (!p)
return;
bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);
/* consume p->pathname, for example via
* bpf_trace_printk() or bpf_perf_event_output()
*/
}
And wouldn't risk exhausting the stack.
Code changes are loosely modeled after commit 6841de8b0d ("bpf: allow
helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
trivial, but since there is not currently a use case for that, it's
reasonable to limit the set of changes.
Also, add new tests to make sure accesses to map element values from
helpers never go out of boundary, even when adjusted.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
check_mem_access() to a separate helper check_map_access_adj(). This
enables to use those checks in other parts of the verifier as well,
where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
example when checking helper function arguments. The same thing is
already happening for other types such as PTR_TO_PACKET and its
check_packet_access() helper.
The code has been copied verbatim, with the only difference of removing
the "off += reg->max_value" statement and moving the sum into the call
statement to check_map_access(), as that was only needed due to the
earlier common check_map_access() call.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull audit fixes from Paul Moore:
"Two small fixes relating to audit's use of fsnotify.
The first patch plugs a leak and the second fixes some lock
shenanigans. The patches are small and I banged on this for an
afternoon with our testsuite and didn't see anything odd"
* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: Fix sleep in atomic
fsnotify: Remove fsnotify_duplicate_mark()
Audit tree code was happily adding new notification marks while holding
spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
lead to sleeping while holding a spinlock, deadlocks due to lock
inversion, and probably other fun. Fix the problem by acquiring
group->mark_mutex earlier.
CC: Paul Moore <paul@paul-moore.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The attempt to prevent overwriting an active state resulted in a
disaster which effectively disables all dynamically allocated hotplug
states.
Cleanup the mess.
Fixes: dc280d9362 ("cpu/hotplug: Prevent overwriting of callbacks")
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer type cleanups from Thomas Gleixner:
"This series does a tree wide cleanup of types related to
timers/timekeeping.
- Get rid of cycles_t and use a plain u64. The type is not really
helpful and caused more confusion than clarity
- Get rid of the ktime union. The union has become useless as we use
the scalar nanoseconds storage unconditionally now. The 32bit
timespec alike storage got removed due to the Y2038 limitations
some time ago.
That leaves the odd union access around for no reason. Clean it up.
Both changes have been done with coccinelle and a small amount of
manual mopping up"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ktime: Get rid of ktime_equal()
ktime: Cleanup ktime_set() usage
ktime: Get rid of the union
clocksource: Use a plain u64 instead of cycle_t
Pull SMP hotplug notifier removal from Thomas Gleixner:
"This is the final cleanup of the hotplug notifier infrastructure. The
series has been reintgrated in the last two days because there came a
new driver using the old infrastructure via the SCSI tree.
Summary:
- convert the last leftover drivers utilizing notifiers
- fixup for a completely broken hotplug user
- prevent setup of already used states
- removal of the notifiers
- treewide cleanup of hotplug state names
- consolidation of state space
There is a sphinx based documentation pending, but that needs review
from the documentation folks"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/armada-xp: Consolidate hotplug state space
irqchip/gic: Consolidate hotplug state space
coresight/etm3/4x: Consolidate hotplug state space
cpu/hotplug: Cleanup state names
cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
staging/lustre/libcfs: Convert to hotplug state machine
scsi/bnx2i: Convert to hotplug state machine
scsi/bnx2fc: Convert to hotplug state machine
cpu/hotplug: Prevent overwriting of callbacks
x86/msr: Remove bogus cleanup from the error path
bus: arm-ccn: Prevent hotplug callback leak
perf/x86/intel/cstate: Prevent hotplug callback leak
ARM/imx/mmcd: Fix broken cpu hotplug handling
scsi: qedi: Convert to hotplug state machine
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
There is no point in having an extra type for extra confusion. u64 is
unambiguous.
Conversion was done with the following coccinelle script:
@rem@
@@
-typedef u64 cycle_t;
@fix@
typedef cycle_t;
@@
-cycle_t
+u64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
hotcpu_notifier(), cpu_notifier(), __hotcpu_notifier(), __cpu_notifier(),
register_hotcpu_notifier(), register_cpu_notifier(),
__register_hotcpu_notifier(), __register_cpu_notifier(),
unregister_hotcpu_notifier(), unregister_cpu_notifier(),
__unregister_hotcpu_notifier(), __unregister_cpu_notifier()
are unused now. Remove them and all related code.
Remove also the now pointless cpu notifier error injection mechanism. The
states can be executed step by step and error rollback is the same as cpu
down, so any state transition can be tested w/o requiring the notifier
error injection.
Some CPU hotplug states are kept as they are (ab)used for hotplug state
tracking.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161221192112.005642358@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Developers manage to overwrite states blindly without thought. That's fatal
and hard to debug. Add sanity checks to make it fail.
This requries to restructure the code so that the dynamic state allocation
happens in the same lock protected section as the actual store. Otherwise
the previous assignment of 'Reserved' to the name field would trigger the
overwrite check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192111.675234535@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf fixes from Ingo Molnar:
"On the kernel side there's two x86 PMU driver fixes and a uprobes fix,
plus on the tooling side there's a number of fixes and some late
updates"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
perf sched timehist: Fix invalid period calculation
perf sched timehist: Remove hardcoded 'comm_width' check at print_summary
perf sched timehist: Enlarge default 'comm_width'
perf sched timehist: Honour 'comm_width' when aligning the headers
perf/x86: Fix overlap counter scheduling bug
perf/x86/pebs: Fix handling of PEBS buffer overflows
samples/bpf: Move open_raw_sock to separate header
samples/bpf: Remove perf_event_open() declaration
samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
tools lib bpf: Add bpf_prog_{attach,detach}
samples/bpf: Switch over to libbpf
perf diff: Do not overwrite valid build id
perf annotate: Don't throw error for zero length symbols
perf bench futex: Fix lock-pi help string
perf trace: Check if MAP_32BIT is defined (again)
samples/bpf: Make perf_event_read() static
uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
samples/bpf: Make samples more libbpf-centric
tools lib bpf: Add flags to bpf_create_map()
tools lib bpf: use __u32 from linux/types.h
...
There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.
Signed-off-by: Jan Kara <jack@suse.cz>
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull final vfs updates from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sg_write()/bsg_write() is not fit to be called under KERNEL_DS
ufs: fix function declaration for ufs_truncate_blocks
fs: exec: apply CLOEXEC before changing dumpable task flags
seq_file: reset iterator to first record for zero offset
vfs: fix isize/pos/len checks for reflink & dedupe
[iov_iter] fix iterate_all_kinds() on empty iterators
move aio compat to fs/aio.c
reorganize do_make_slave()
clone_private_mount() doesn't need to touch namespace_sem
remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails. Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When ivoked with CPUHP_AP_ONLINE_DYN state __cpuhp_setup_state()
is expected to return positive value which is the hotplug state that
the routine assigns.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The TPM PCRs are only reset on a hard reboot. In order to validate a
TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement
list of the running kernel must be saved and restored on boot.
This patch uses the kexec buffer passing mechanism to pass the
serialized IMA binary_runtime_measurements to the next kernel.
Link: http://lkml.kernel.org/r/1480554346-29071-7-git-send-email-zohar@linux.vnet.ibm.com
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: Andreas Steffen <andreas.steffen@strongswan.org>
Cc: Josh Sklar <sklar@linux.vnet.ibm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer fix from Thomas Gleixner:
"Prevent NULL pointer dereferencing in the tick broadcast code. Old
bug, which got unearthed by the hotplug ordering problem"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Prevent NULL pointer dereference
Pull SMP hotplug fixes from Thomas Gleixner:
"Two fixlets for cpu hotplug:
- Fix a subtle ordering problem with the dummy timer. This happened
to work before the conversion by chance due to initcall ordering.
- Fix the function comment for __cpuhp_setup_state()"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Clarify description of __cpuhp_setup_state() return value
clocksource/dummy_timer: Move hotplug callback after the real timers
Pull irq fix from Thomas Gleixner:
"A fix for the irq affinity spread algorithm so it handles non linear
node numbering nicely"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Fix node generation from cpumask
Commit:
72e6ae285a ('ARM: 8043/1: uprobes need icache flush after xol write'
... has introduced an arch-specific method to ensure all caches are
flushed appropriately after an instruction is written to an XOL page.
However, when the XOL area is created and the out-of-line breakpoint
instruction is copied, caches are not flushed at all and stale data may
be found in icache.
Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
the arch to ensure all caches are updated accordingly.
This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Victor Kamensky <victor.kamensky@linaro.org>
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes and cleanups from David Miller:
1) Revert bogus nla_ok() change, from Alexey Dobriyan.
2) Various bpf validator fixes from Daniel Borkmann.
3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
drivers, from Dongpo Li.
4) Several ethtool ksettings conversions from Philippe Reynes.
5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert.
6) XDP support for virtio_net, from John Fastabend.
7) Fix NAT handling within a vrf, from David Ahern.
8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
net: mv643xx_eth: fix build failure
isdn: Constify some function parameters
mlxsw: spectrum: Mark split ports as such
cgroup: Fix CGROUP_BPF config
qed: fix old-style function definition
net: ipv6: check route protocol when deleting routes
r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected
irda: w83977af_ir: cleanup an indent issue
net: sfc: use new api ethtool_{get|set}_link_ksettings
net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
bpf: fix mark_reg_unknown_value for spilled regs on map value marking
bpf: fix overflow in prog accounting
bpf: dynamically allocate digest scratch buffer
gtp: Fix initialization of Flags octet in GTPv1 header
gtp: gtp_check_src_ms_ipv4() always return success
net/x25: use designated initializers
isdn: use designated initializers
...
Pull more vfs updates from Al Viro:
"In this pile:
- autofs-namespace series
- dedupe stuff
- more struct path constification"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
ocfs2: charge quota for reflinked blocks
ocfs2: fix bad pointer cast
ocfs2: always unlock when completing dio writes
ocfs2: don't eat io errors during _dio_end_io_write
ocfs2: budget for extent tree splits when adding refcount flag
ocfs2: prohibit refcounted swapfiles
ocfs2: add newlines to some error messages
ocfs2: convert inode refcount test to a helper
simple_write_end(): don't zero in short copy into uptodate
exofs: don't mess with simple_write_{begin,end}
9p: saner ->write_end() on failing copy into non-uptodate page
fix gfs2_stuffed_write_end() on short copies
fix ceph_write_end()
nfs_write_end(): fix handling of short copies
vfs: refactor clone/dedupe_file_range common functions
fs: try to clone files first in vfs_copy_file_range
vfs: misc struct path constification
namespace.c: constify struct path passed to a bunch of primitives
quota: constify struct path in quota_on
...
Martin reported a verifier issue that hit the BUG_ON() for his
test case in the mark_reg_unknown_value() function:
[ 202.861380] kernel BUG at kernel/bpf/verifier.c:467!
[...]
[ 203.291109] Call Trace:
[ 203.296501] [<ffffffff811364d5>] mark_map_reg+0x45/0x50
[ 203.308225] [<ffffffff81136558>] mark_map_regs+0x78/0x90
[ 203.320140] [<ffffffff8113938d>] do_check+0x226d/0x2c90
[ 203.331865] [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
[ 203.343403] [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
[ 203.355705] [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
[ 203.369158] [<ffffffff812d8188>] ? security_capable+0x48/0x60
[ 203.382035] [<ffffffff811351a4>] SyS_bpf+0x124/0x960
[ 203.393185] [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
[ 203.406258] [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
This issue got uncovered after the fix in a08dd0da53 ("bpf: fix
regression on verifier pruning wrt map lookups"). The reason why it
wasn't noticed before was, because as mentioned in a08dd0da53,
mark_map_regs() was doing the id matching incorrectly based on the
uncached regs[regno].id. So, in the first loop, we walked all regs
and as soon as we found regno == i, then this reg's id was cleared
when calling mark_reg_unknown_value() thus that every subsequent
register was probed against id of 0 (which, in combination with the
PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
register state can hold), and therefore wasn't type transitioned such
as in the spilled register case for the second loop.
Now since that got fixed, it turned out that 57a09bf0a4 ("bpf:
Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
mark_reg_unknown_value() incorrectly for the spilled regs, and thus
hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
Although spilled regs have the same type as the non-spilled regs
for the verifier state, that is, struct bpf_reg_state, they are
semantically different from the non-spilled regs. In other words,
there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
in the stack, for example, register R<x> could have been spilled by
the program to stack location X, Y, Z, and in mark_map_regs() we
need to scan these stack slots of type STACK_SPILL for potential
registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
Therefore, depending on the location, the spilled_regs regno can
be a lot higher than just MAX_BPF_REG's value since we operate on
stack instead. The reset in mark_reg_unknown_value() itself is
just fine, only that the BUG_ON() was inappropriate for this. Fix
it by making a __mark_reg_unknown_value() version that can be
called from mark_map_reg() generically; we know for the non-spilled
case that the regno is always < MAX_BPF_REG anyway.
Fixes: 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit aaac3ba95e ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().
This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.
There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geert rightfully complained that 7bd509e311 ("bpf: add prog_digest
and expose it via fdinfo/netlink") added a too large allocation of
variable 'raw' from bss section, and should instead be done dynamically:
# ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
function old new delta
raw - 32832 +32832
[...]
Since this is only relevant during program creation path, which can be
considered slow-path anyway, lets allocate that dynamically and be not
implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
the beginning of replace_map_fd_with_map_ptr() and also error handling
stays straight forward.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
registers") introduced a regression where existing programs stopped
loading due to reaching the verifier's maximum complexity limit,
whereas prior to this commit they were loading just fine; the affected
program has roughly 2k instructions.
What was found is that state pruning couldn't be performed effectively
anymore due to mismatches of the verifier's register state, in particular
in the id tracking. It doesn't mean that 57a09bf0a4 is incorrect per
se, but rather that verifier needs to perform a lot more work for the
same program with regards to involved map lookups.
Since commit 57a09bf0a4 is only about tracking registers with type
PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
until they are promoted through pattern matching with a NULL check to
either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
id becomes irrelevant for the transitioned types.
For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
transferred further into other types that don't make use of it. Among
others, one example is where UNKNOWN_VALUE is set on function call
return with RET_INTEGER return type.
states_equal() will then fall through the memcmp() on register state;
note that the second memcmp() uses offsetofend(), so the id is part of
that since d2a4dd37f6 ("bpf: fix state equivalence"). But the bisect
pointed already to 57a09bf0a4, where we really reach beyond complexity
limit. What I found was that states_equal() often failed in this
case due to id mismatches in spilled regs with registers in type
PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
a memcmp() on their reg state and don't have any other optimizations
in place, therefore also id was relevant in this case for making a
pruning decision.
We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
For the affected program, it resulted in a ~17 fold reduction of
complexity and let the program load fine again. Selftest suite also
runs fine. The only other place where env->id_gen is used currently is
through direct packet access, but for these cases id is long living, thus
a different scenario.
Also, the current logic in mark_map_regs() is not fully correct when
marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
it's id is reset and any subsequent registers that hold the original id
and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
anymore, since mark_map_reg() reuses the uncached regs[regno].id that
was just overridden. Note, we don't need to cache it outside of
mark_map_regs(), since it's called once on this_branch and the other
time on other_branch, which are both two independent verifier states.
A test case for this is added here, too.
Fixes: 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs updates from Al Viro:
- more ->d_init() stuff (work.dcache)
- pathname resolution cleanups (work.namei)
- a few missing iov_iter primitives - copy_from_iter_full() and
friends. Either copy the full requested amount, advance the iterator
and return true, or fail, return false and do _not_ advance the
iterator. Quite a few open-coded callers converted (and became more
readable and harder to fuck up that way) (work.iov_iter)
- several assorted patches, the big one being logfs removal
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
logfs: remove from tree
vfs: fix put_compat_statfs64() does not handle errors
namei: fold should_follow_link() with the step into not-followed link
namei: pass both WALK_GET and WALK_MORE to should_follow_link()
namei: invert WALK_PUT logics
namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
namei: saner calling conventions for mountpoint_last()
namei.c: get rid of user_path_parent()
switch getfrag callbacks to ..._full() primitives
make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
[iov_iter] new primitives - copy_from_iter_full() and friends
don't open-code file_inode()
ceph: switch to use of ->d_init()
ceph: unify dentry_operations instances
lustre: switch to use of ->d_init()
Highlights include:
- Support for the kexec_file_load() syscall, which is a prereq for secure and
trusted boot.
- Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
- Sort the exception tables at build time, to save time at boot, and store
them as relative offsets to save space in the kernel image & memory.
- Allow building the kernel with thin archives, which should allow us to build
an allyesconfig once some other fixes land.
- Build fixes to allow us to correctly rebuild when changing the kernel endian
from big to little or vice versa.
- Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
- Initial stack protector support (-fstack-protector).
- Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
- Fix an oops in cxl coredump generation when cxl_get_fd() is used.
- Freescale updates from Scott: "Highlights include 8xx hugepage support,
qbman fixes/cleanup, device tree updates, and some misc cleanup."
- Many and varied fixes and minor enhancements as always.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
Z2W/m3gxafnOeGgBqvyv
=xOzf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for the kexec_file_load() syscall, which is a prereq for
secure and trusted boot.
- Prevent kernel execution of userspace on P9 Radix (similar to
SMEP/PXN).
- Sort the exception tables at build time, to save time at boot, and
store them as relative offsets to save space in the kernel image &
memory.
- Allow building the kernel with thin archives, which should allow us
to build an allyesconfig once some other fixes land.
- Build fixes to allow us to correctly rebuild when changing the
kernel endian from big to little or vice versa.
- Plumbing so that we can avoid doing a full mm TLB flush on P9
Radix.
- Initial stack protector support (-fstack-protector).
- Support for dumping the radix (aka. Linux) and hash page tables via
debugfs.
- Fix an oops in cxl coredump generation when cxl_get_fd() is used.
- Freescale updates from Scott: "Highlights include 8xx hugepage
support, qbman fixes/cleanup, device tree updates, and some misc
cleanup."
- Many and varied fixes and minor enhancements as always.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"
[ And thanks to Michael, who took time off from a new baby to get this
pull request done. - Linus ]
* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
powerpc/fsl/dts: add FMan node for t1042d4rdb
powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
powerpc/fsl/dts: add QMan and BMan nodes on t1024
powerpc/fsl/dts: add QMan and BMan nodes on t1023
soc/fsl/qman: test: use DEFINE_SPINLOCK()
powerpc/fsl-lbc: use DEFINE_SPINLOCK()
powerpc/8xx: Implement support of hugepages
powerpc: get hugetlbpage handling more generic
powerpc: port 64 bits pgtable_cache to 32 bits
powerpc/boot: Request no dynamic linker for boot wrapper
soc/fsl/bman: Use resource_size instead of computation
soc/fsl/qe: use builtin_platform_driver
powerpc/fsl_pmc: use builtin_platform_driver
powerpc/83xx/suspend: use builtin_platform_driver
powerpc/ftrace: Fix the comments for ftrace_modify_code
powerpc/perf: macros for power9 format encoding
powerpc/perf: power9 raw event format encoding
powerpc/perf: update attribute_group data structure
powerpc/perf: factor out the event format field
powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
...
o STM can hook into the function tracer
o Function filtering now supports more advance glob matching
o Ftrace selftests updates and added tests
o Softirq tag in traces now show only softirqs
o ARM nop added to non traced locations at compile time
o New trace_marker_raw file that allows for binary input
o Optimizations to the ring buffer
o Removal of kmap in trace_marker
o Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
o Other various fixes and clean ups
Note, there are two patches marked for stable. These were discovered
near the end of the 4.9 rc release cycle. By the time I had them tested
it was just a matter of days before 4.9 would be released, and I
figured I would just submit them in the merge window. They are old
bugs and not critical. Nothing non-root could abuse.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJYUrFHFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
2+AIAIr20kSQV/nA5htGAeCTobVk3WUxY6bvjd9mIJDKPP19akNLyREW0G3KnfCr
yhx4aFRZG98fRu/6F8qieRosyN36lADDVYHelMFHMpcTOpE2aZGjaaOuNGxOEA9v
FmMPTX+K3+dzKyFP4l68R3+5JuQ1/AqLTioTWeLW8IDQ2OOVsjD8+0BuXrNKMJDY
o6U4Hk5U/vn+zHc6BmgBzloAXemBd7iJ1t5V3FRRGvm8yv3HU85Twc5ofGeYTWvB
J8PboEywRlIzxg0Kd8mxnMI5PgaKZSEc2ub8E7cY/CZ5PYpDE2xDA2hJmJgfYp00
1VW+DHRpRZfElsCcya6S6P4bs5Y=
=MGZ/
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This release has a few updates:
- STM can hook into the function tracer
- Function filtering now supports more advance glob matching
- Ftrace selftests updates and added tests
- Softirq tag in traces now show only softirqs
- ARM nop added to non traced locations at compile time
- New trace_marker_raw file that allows for binary input
- Optimizations to the ring buffer
- Removal of kmap in trace_marker
- Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
- Other various fixes and clean ups"
* tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits)
selftests: ftrace: Shift down default message verbosity
kprobes/trace: Fix kprobe selftest for newer gcc
tracing/kprobes: Add a helper method to return number of probe hits
tracing/rb: Init the CPU mask on allocation
tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
fgraph: Handle a case where a tracer ignores set_graph_notrace
tracing: Replace kmap with copy_from_user() in trace_marker writing
ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
tracing: Allow benchmark to be enabled at early_initcall()
tracing: Have system enable return error if one of the events fail
tracing: Do not start benchmark on boot up
tracing: Have the reg function allow to fail
ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
ring-buffer: Froce rb_update_write_stamp() to be inlined
ring-buffer: Force inline of hotpath helper functions
tracing: Make __buffer_unlock_commit() always_inline
tracing: Make tracepoint_printk a static_key
ring-buffer: Always inline rb_event_data()
ring-buffer: Make rb_reserve_next_event() always inlined
...
If CONFIG_PRINTK=n:
kernel/printk/printk.c:1893: warning: ‘cont’ defined but not used
Note that there are actually two different struct cont definitions and
objects: the first one is used if CONFIG_PRINTK=y, the second one became
unused by removing console_cont_flush().
Fixes: 5c2992ee7f ("printk: remove console flushing special cases for partial buffered lines")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Petr Mladek <pmladek@suse.com>
[ I do the occasional "allnoconfig" builds, but apparently not often
enough - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 34c3d9819f ("genirq/affinity: Provide smarter irq spreading
infrastructure") introduced a better IRQ spreading mechanism, taking
account of the available NUMA nodes in the machine.
Problem is that the algorithm of retrieving the nodemask iterates
"linearly" based on the number of online nodes - some architectures
present non-linear node distribution among the nodemask, like PowerPC.
If this is the case, the algorithm lead to a wrong node count number
and therefore to a bad/incomplete IRQ affinity distribution.
For example, this problem were found in a machine with 128 CPUs and two
nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
distributed). This led to a wrong affinity distribution which then led to
a bad mq allocation for nvme driver.
Finally, we take the opportunity to fix a comment regarding the affinity
distribution when we have _more_ nodes than vectors.
Fixes: 34c3d9819f ("genirq/affinity: Provide smarter irq spreading infrastructure")
Reported-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: hch@lst.de
Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.
If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.
Reported-by: Mason <slash.tmp@free.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Richard Cochran <rcochran@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Sebastian Frias <sf84@laposte.net>
Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
Contained in this update:
- DAX PMD vaults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYUgqdAAoJEK3oKUf0dfodQgsP/1dJ4qUc6cRk8kL+f10FoIek
oFzdViRHZj8cROGe2n2YTBJtPa9KjU5DNHnxaxWZBN4ZpItp/uN1sAQhgtNQ4/cN
C3JF6B/+/dIbNSbd7DwvSl0dMWknzmrB+Myfs2ZPpMA1S4GInk1MOJSj7AQdYAvJ
dS0dQWAuIB20cahwuGA4y7zUniYL1IcF/BH8hlmzpcUNUoJ9AkR1hTg5/aVfmga3
w2p1vZyT2E4xs/Ff4FYW5MzPGxLVQMZVNIAXAcJl+c61z46ndXqidSmVHGvc+Tlt
ouxftHy/7KqowZlCFss1pSXg9HlXHhjS+iLbZerfcjO2qldriZS+QqQyASmQzPAz
+PpnMfVOj+yjsXKyIHWuS1G35aV16pPWwdA0ECeU6yv9iZ7tSz5rvSrsPZPLFz4x
RVhcKbmXR3y8DugkmtznU5ozxPt5hbbstEV3leCzxJpZu5reRJThUW7nYkSd0CEJ
ZyT/GP6Aq/MM8O/hOgVutAH409dsrYok8m/lq1J7VbNUt8inylcsMWsBeX/0/AHY
aC7I2Vx8bnbfL+C8wYKYhuShOGSch93O5hDUXdH2K/Sm5cK4y2asWge6MfFsS6Lu
waVYwd5aYBlNbzkvUMm2I5EV4cCCR3YwWYwfBEP7kPYUDxN14huOz6lVXnQPDLQ1
qsV1aNfK9PPiw6Fcaop0
=HwDG
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There is quite a varied bunch of stuff in this update, and some of it
you will have already merged through the ext4 tree which imported the
dax-4.10-iomap-pmd topic branch from the XFS tree.
There is also a new direct IO implementation that uses the iomap
infrastructure. It's much simpler, faster, and has lower IO latency
than the existing direct IO infrastructure.
Summary:
- DAX PMD faults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS
i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups"
* tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
xfs: nuke unused tracepoint definitions
xfs: use GPF_NOFS when allocating btree cursors
xfs: use xfs_vn_setattr_size to check on new size
xfs: deprecate barrier/nobarrier mount option
xfs: Always flush caches when integrity is required
xfs: ignore leaf attr ichdr.count in verifier during log replay
xfs: use rhashtable to track buffer cache
xfs: optimise CRC updates
xfs: make xfs btree stats less huge
xfs: don't cap maximum dedupe request length
xfs: don't allow di_size with high bit set
xfs: error out if trying to add attrs and anextents > 0
xfs: don't crash if reading a directory results in an unexpected hole
xfs: complain if we don't get nextents bmap records
xfs: check for bogus values in btree block headers
xfs: forbid AG btrees with level == 0
xfs: several xattr functions can be void
xfs: handle cow fork in xfs_bmap_trace_exlist
xfs: pass state not whichfork to trace_xfs_extlist
xfs: Move AGI buffer type setting to xfs_read_agi
...
It actively hurts proper merging, and makes for a lot of special cases.
There was a good(ish) reason for doing it originally, but it's getting
too painful to maintain. And most of the original reasons for it are
long gone.
So instead of having special code to flush partial lines to the console
(as opposed to the record buffers), do _all_ the console writing from
the record buffer, and be done with it.
If an oops happens (or some other synchronous event), we will flush the
partial lines due to the oops printing activity, so this does not affect
that. It does mean that if you have a completely hung machine, a
partial preceding line may not have been printed out.
That was some of the original reason for this complexity, in fact, back
when we used to test for the historical i386 "halt" instruction problem
by doing
pr_info("Checking 'hlt' instruction... ");
if (!boot_cpu_data.hlt_works_ok) {
pr_cont("disabled\n");
return;
}
halt();
halt();
halt();
halt();
pr_cont("OK\n");
and that model no longer works (it the 'hlt' instruction kills the
machine, the partial line won't have been flushed, so you won't even see
it).
Of course, that was also back in the days when people actually had
textual console output rather than a graphical splash-screen at bootup.
How times change..
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The record logging code looks at the previous record flags in various
ways, and they are all wrong.
You can't use the previous record flags to determine anything about the
next record, because they may simply not be related. In particular, the
reason the previous record was a continuation record may well be exactly
_because_ the new record was printed by a different process, which is
why the previous record was flushed.
So all those games are simply wrong, and make the code hard to
understand (because the code fundamentally cdoes not make sense).
So remove it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Summary of modules changes for the 4.10 merge window:
* The rodata= cmdline parameter has been extended to additionally
apply to module mappings
* Fix a hard to hit race between module loader error/clean up
handling and ftrace registration
* Some code cleanups, notably panic.c and modules code use a
unified taint_flags table now. This is much cleaner than
duplicating the taint flag code in modules.c
Signed-off-by: Jessica Yu <jeyu@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYUf6/AAoJEMBFfjjOO8Fy5NoP+gOIus26yWWGymI495jVnX7n
wCga5JgwOL0SLBIPmiDVI7K+jz4eoQZb94eJcwkWDuw2/IvOdF1kB8ha1EOBRMSg
nb9HfIDlWiAPKkyUxe+k6XDb+BMPN3FUSYmBAKD3utsQkD1JWBLY8Id4e234y8Fo
sb3a6rLJbvIEXANrMeU7zO4/y1bVxQAeQPQbVPwlid5s76RKYH6JdGXoo6FKK0uE
Z3I8uQjqjmJ5U4vpjjWl0w+Qa7hIm/x05GpirtNxN6ztxjR+98c/4uRIry8oOX+I
KqRXDOnJ1l/rCwhp+pGLwPfCoDds+V3bknyOwYoxK3hqVVUAd8H0qd1JQ8XClwyJ
jnE0+EQpTt9brOO1Oq2XC+EDjpiuyYm3u91TFwE2VFmP98daBZsX6qY7bm03/GQq
ZLRthWPILNX9glGj4nbHQgdAKmRvYDO3SzWjFZNA75Mr2hbRKLJoWNvfgupDgjsF
giawxV/OcWXvEX92fzkwoUszpfWwoDhGsbimG2SCKYB87vNniG7wrgdjp5aWHhOL
qCUpUhCvE9/dO7kPRinqk5tnpAUGY2jMZ0QgVbpToF6FiHJJSyDjWHR9n0Bl1QTX
uAEZB/Hoav9frZ+MQC/1Yzhq5ejDbEm1ByjolJgbjl6YHBlQceL6NQpFmyEkrn7c
Tx+Q/PvG7/gfxFGMirf1
=bhCS
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.10 merge window:
- The rodata= cmdline parameter has been extended to additionally
apply to module mappings
- Fix a hard to hit race between module loader error/clean up
handling and ftrace registration
- Some code cleanups, notably panic.c and modules code use a unified
taint_flags table now. This is much cleaner than duplicating the
taint flag code in modules.c"
* tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: fix DEBUG_SET_MODULE_RONX typo
module: extend 'rodata=off' boot cmdline parameter to module mappings
module: Fix a comment above strong_try_module_get()
module: When modifying a module's text ignore modules which are going away too
module: Ensure a module's state is set accordingly during module coming cleanup code
module: remove trailing whitespace
taint/module: Clean up global and module taint flags handling
modpost: free allocated memory
Merge more updates from Andrew Morton:
- a few misc things
- kexec updates
- DMA-mapping updates to better support networking DMA operations
- IPC updates
- various MM changes to improve DAX fault handling
- lots of radix-tree changes, mainly to the test suite. All leading up
to reimplementing the IDA/IDR code to be a wrapper layer over the
radix-tree. However the final trigger-pulling patch is held off for
4.11.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
radix tree test suite: delete unused rcupdate.c
radix tree test suite: add new tag check
radix-tree: ensure counts are initialised
radix tree test suite: cache recently freed objects
radix tree test suite: add some more functionality
idr: reduce the number of bits per level from 8 to 6
rxrpc: abstract away knowledge of IDR internals
tpm: use idr_find(), not idr_find_slowpath()
idr: add ida_is_empty
radix tree test suite: check multiorder iteration
radix-tree: fix replacement for multiorder entries
radix-tree: add radix_tree_split_preload()
radix-tree: add radix_tree_split
radix-tree: add radix_tree_join
radix-tree: delete radix_tree_range_tag_if_tagged()
radix-tree: delete radix_tree_locate_item()
radix-tree: improve multiorder iterators
btrfs: fix race in btrfs_free_dummy_fs_info()
radix-tree: improve dump output
radix-tree: make radix_tree_find_next_bit more useful
...
Patch series "mm: unexport __get_user_pages_unlocked()".
This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.
It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.
Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)
This patch (of 2):
Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().
[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Clean up watchdog handlers", v2.
This is an attempt to cleanup watchdog handlers. Right now,
kernel/watchdog.c implements both softlockup and hardlockup detectors.
Softlockup code is generic. Hardlockup code is arch specific. Some
architectures don't use hardlockup detectors. They use their own
watchdog detectors. To make both these combination work, we have
numerous #ifdefs in kernel/watchdog.c.
We are trying here to make these handlers independent of each other.
Also provide an interface for architectures to implement their own
handlers. watchdog_nmi_enable and watchdog_nmi_disable will be defined
as weak such that architectures can override its definitions.
Thanks to Don Zickus for his suggestions.
Here are our previous discussions
http://www.spinics.net/lists/sparclinux/msg16543.htmlhttp://www.spinics.net/lists/sparclinux/msg16441.html
This patch (of 3):
Move shared macros and definitions to nmi.h so that watchdog.c, new file
watchdog_hld.c or any other architecture specific handler can use those
definitions.
Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Josh Hunt <johunt@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The OpenRISC compiler (so far) fails to optimize away a large portion of
code containing a reference to posix_timer_event in alarmtimer.c when
CONFIG_POSIX_TIMERS is unset. Let's give it a direct clue to let the
build succeed.
This fixes
[linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event'
reported by kbuild test robot.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kdb_trap_printk allows to pass normal printk() messages to kdb via
vkdb_printk(). For example, it is used to get backtrace using the
classic show_stack(), see kdb_show_stack().
vkdb_printf() tries to avoid a potential infinite loop by disabling the
trap. But this approach is racy, for example:
CPU1 CPU2
vkdb_printf()
// assume that kdb_trap_printk == 0
saved_trap_printk = kdb_trap_printk;
kdb_trap_printk = 0;
kdb_show_stack()
kdb_trap_printk++;
Problem1: Now, a nested printk() on CPU0 calls vkdb_printf()
even when it should have been disabled. It will not
cause a deadlock but...
// using the outdated saved value: 0
kdb_trap_printk = saved_trap_printk;
kdb_trap_printk--;
Problem2: Now, kdb_trap_printk == -1 and will stay like this.
It means that all messages will get passed to kdb from
now on.
This patch removes the racy saved_trap_printk handling. Instead, the
recursion is prevented by a check for the locked CPU.
The solution is still kind of racy. A non-related printk(), from
another process, might get trapped by vkdb_printf(). And the wanted
printk() might not get trapped because kdb_printf_cpu is assigned. But
this problem existed even with the original code.
A proper solution would be to get_cpu() before setting kdb_trap_printk
and trap messages only from this CPU. I am not sure if it is worth the
effort, though.
In fact, the race is very theoretical. When kdb is running any of the
commands that use kdb_trap_printk there is a single active CPU and the
other CPUs should be in a holding pen inside kgdb_cpu_enter().
The only time this is violated is when there is a timeout waiting for
the other CPUs to report to the holding pen.
Finally, note that the situation is a bit schizophrenic. vkdb_printf()
explicitly allows recursion but only from KDB code that calls
kdb_printf() directly. On the other hand, the generic printk()
recursion is not allowed because it might cause an infinite loop. This
is why we could not hide the decision inside vkdb_printf() easily.
Link: http://lkml.kernel.org/r/1480412276-16690-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kdb_printf_lock does not prevent other CPUs from entering the critical
section because it is ignored when KDB_STATE_PRINTF_LOCK is set.
The problematic situation might look like:
CPU0 CPU1
vkdb_printf()
if (!KDB_STATE(PRINTF_LOCK))
KDB_STATE_SET(PRINTF_LOCK);
spin_lock_irqsave(&kdb_printf_lock, flags);
vkdb_printf()
if (!KDB_STATE(PRINTF_LOCK))
BANG: The PRINTF_LOCK state is set and CPU1 is entering the critical
section without spinning on the lock.
The problem is that the code tries to implement locking using two state
variables that are not handled atomically. Well, we need a custom
locking because we want to allow reentering the critical section on the
very same CPU.
Let's use solution from Petr Zijlstra that was proposed for a similar
scenario, see
https://lkml.kernel.org/r/20161018171513.734367391@infradead.org
This patch uses the same trick with cmpxchg(). The only difference is
that we want to handle only recursion from the same context and
therefore we disable interrupts.
In addition, KDB_STATE_PRINTF_LOCK is removed. In fact, we are not able
to set it a non-racy way.
Link: http://lkml.kernel.org/r/1480412276-16690-3-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kdb_event state variable is only set but never checked in the kernel
code.
http://www.spinics.net/lists/kdb/msg01733.html suggests that this
variable affected WARN_CONSOLE_UNLOCKED() in the original
implementation. But this check never went upstream.
The semantic is unclear and racy. The value is updated after the
kdb_printf_lock is acquired and after it is released. It should be
symmetric at minimum. The value should be manipulated either inside or
outside the locked area.
Fortunately, it seems that the original function is gone and we could
simply remove the state variable.
Link: http://lkml.kernel.org/r/1480412276-16690-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've got a delay loop waiting for secondary CPUs. That loop uses
loops_per_jiffy. However, loops_per_jiffy doesn't actually mean how
many tight loops make up a jiffy on all architectures. It is quite
common to see things like this in the boot log:
Calibrating delay loop (skipped), value calculated using timer
frequency.. 48.00 BogoMIPS (lpj=24000)
In my case I was seeing lots of cases where other CPUs timed out
entering the debugger only to print their stack crawls shortly after the
kdb> prompt was written.
Elsewhere in kgdb we already use udelay(), so that should be safe enough
to use to implement our timeout. We'll delay 1 ms for 1000 times, which
should give us a full second of delay (just like the old code wanted)
but allow us to notice that we're done every 1 ms.
[akpm@linux-foundation.org: simplifications, per Daniel]
Link: http://lkml.kernel.org/r/1477091361-2039-1-git-send-email-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Brian Norris <briannorris@chromium.org>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is fragile that some definitions acquired via transitive
dependencies, as shown in below:
atomic_* (<linux/atomic.h>)
ENOMEM/EN* (<linux/errno.h>)
EXPORT_SYMBOL (<linux/export.h>)
device_initcall (<linux/init.h>)
preempt_* (<linux/preempt.h>)
Include them to prevent possible issues.
Link: http://lkml.kernel.org/r/1481163221-40170-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Smatch complains that we started using the array offset before we
checked that it was valid.
Fixes: 017c59c042 ('relay: Use per CPU constructs for the relay channel buffer pointers')
Link: http://lkml.kernel.org/r/20161013084947.GC16198@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not break lines while printk()ing values.
kernel: warning: process `tomoyo_file_tes' used the deprecated sysctl system call with
kernel: 3.
kernel: 5.
kernel: 56.
kernel:
Link: http://lkml.kernel.org/r/1480814833-4976-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>