Pull perf fixes from Ingo Molnar:
"Six fixes for bugs that were found via fuzzing, and a trivial
hw-enablement patch for AMD Family-17h CPU PMUs"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Allow only a single PMU/box within an events group
perf/x86/intel: Cure bogus unwind from PEBS entries
perf/x86: Restore TASK_SIZE check on frame pointer
perf/core: Fix address filter parser
perf/x86: Add perf support for AMD family-17h processors
perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC
perf/core: Do not set cpuctx->cgrp for unscheduled cgroups
Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().
So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
it, but see the next patch.
However the comment is correct in that autogroup_move_group() must always
change task_group() for every thread so the sysctl_ check is very wrong;
we can race with cgroups and even sys_setsid() is not safe because a task
running with task_group() == ag->tg must participate in refcounting:
int main(void)
{
int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);
assert(sctl > 0);
if (fork()) {
wait(NULL); // destroy the child's ag/tg
pause();
}
assert(pwrite(sctl, "1\n", 2, 0) == 2);
assert(setsid() > 0);
if (fork())
pause();
kill(getppid(), SIGKILL);
sleep(1);
// The child has gone, the grandchild runs with kref == 1
assert(pwrite(sctl, "0\n", 2, 0) == 2);
assert(setsid() > 0);
// runs with the freed ag/tg
for (;;)
sleep(1);
return 0;
}
crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
autogroup_move_group() actually frees the task_group or this happens later.
Reported-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull sparc fixes from David Miller:
1) With modern networking cards we can run out of 32-bit DMA space, so
support 64-bit DMA addressing when possible on sparc64. From Dave
Tushar.
2) Some signal frame validation checks are inverted on sparc32, fix
from Andreas Larsson.
3) Lockdep tables can get too large in some circumstances on sparc64,
add a way to adjust the size a bit. From Babu Moger.
4) Fix NUMA node probing on some sun4v systems, from Thomas Tai.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: drop duplicate header scatterlist.h
lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
sunbmac: Fix compiler warning
sunqe: Fix compiler warnings
sparc64: Enable 64-bit DMA
sparc64: Enable sun4v dma ops to use IOMMU v2 APIs
sparc64: Bind PCIe devices to use IOMMU v2 service
sparc64: Initialize iommu_map_table and iommu_pool
sparc64: Add ATU (new IOMMU) support
sparc64: Add FORCE_MAX_ZONEORDER and default to 13
sparc64: fix compile warning section mismatch in find_node()
sparc32: Fix inverted invalid_frame_pointer checks on sigreturns
sparc64: Fix find_node warning if numa node cannot be found
Pull networking fixes from David Miller:
1) Clear congestion control state when changing algorithms on an
existing socket, from Florian Westphal.
2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
from Jia Jie Ho.
3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
CAVALLARO.
4) Fix udplite multicast delivery handling, it ignores the udp_table
parameter passed into the lookups, from Pablo Neira Ayuso.
5) Synchronize the space estimated by rtnl_vfinfo_size and the space
actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.
6) Fix memory leak in fib_info when splitting nodes, from Alexander
Duyck.
7) If a driver does a napi_hash_del() explicitily and not via
netif_napi_del(), it must perform RCU synchronization as needed. Fix
this in virtio-net and bnxt drivers, from Eric Dumazet.
8) Likewise, it is not necessary to invoke napi_hash_del() is we are
also doing neif_napi_del() in the same code path. Remove such calls
from be2net and cxgb4 drivers, also from Eric Dumazet.
9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
from WANG Cong.
10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.
11) We cannot cache routes in ip6_tunnel when using inherited traffic
classes, from Paolo Abeni.
12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.
13) Splice operations cannot use freezable blocking calls in AF_UNIX,
from WANG Cong.
14) Link dump filtering by master device and kind support added an error
in loop index updates during the dump if we actually do filter, fix
from Zhang Shengju.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
tcp: zero ca_priv area when switching cc algorithms
net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
tipc: eliminate obsolete socket locking policy description
rtnl: fix the loop index update error in rtnl_dump_ifinfo()
l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
net: macb: add check for dma mapping error in start_xmit()
rtnetlink: fix FDB size computation
netns: fix get_net_ns_by_fd(int pid) typo
af_unix: conditionally use freezable blocking calls in read
net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
net: ethernet: ti: cpsw: add missing sanity check
net: ethernet: ti: cpsw: fix secondary-emac probe error path
net: ethernet: ti: cpsw: fix of_node and phydev leaks
net: ethernet: ti: cpsw: fix deferred probe
net: ethernet: ti: cpsw: fix mdio device reference leak
net: ethernet: ti: cpsw: fix bad register access in probe error path
net: sky2: Fix shutdown crash
cfg80211: limit scan results cache size
net sched filters: pass netlink message flags in event notification
...
The token table passed into match_token() must be null-terminated, which
it currently is not in the perf's address filter string parser, as caught
by Vince's perf_fuzzer and KASAN.
It doesn't blow up otherwise because of the alignment padding of the table
to the next element in the .rodata, which is luck.
Fixing by adding a null-terminator to the token table.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: stable@vger.kernel.org # v4.7+
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reduce the size of data structure for lockdep entries by half if
PROVE_LOCKING_SMALL if defined. This is used only for sparc.
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries. Fix this up by doing a few things
1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
life and just adds extra complexity.
2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.
3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.
This fixes the testcase provided by Jann as well as a few other theoretical
problems.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
can cause a ftrace check to trigger and disable ftrace. This is because
of the way modules are registered to ftrace. Their functions are
loaded in the ftrace function tables but set to "disabled" since
they are still in the process of being loaded by the module. After
the module is finished, it calls back into the ftrace infrastructure
to enable it. Looking deeper into the locations that access all the
functions in the table, I found more locations that should ignore
the disabled ones.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJYKxq1FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
k24IAIvZoDRiFvwdF8ZNrngD8ZT1IfEg771yzouKgTXT+mHhty8IpO6YRIYxJ5vS
bLmmg1wGhSjBUGdAsm1e16r8bbxuZo7SbYx0CUz/XEQimbKJZU56M6oBrp22qtEv
f4TBN0RPHhghG5IpHTm1Cvh4FZ9NgDkoWVuSKf7/vhxJ1GNpu/cpaTS60x+X6Fdj
mFVIdwDRcupqXJLtqoB4tDC+iekqD0Zwj9yxpBL12Em/PvcgtXsBO4oaP0vEMOES
ylGwEY0jCSpRJ2EsX1QnZDMP3DM0m9JLGmkzGuTwekGsHxe9+ODyQeYZyZ105rbD
hYPfo3xS0diXnOyosxetgjefUwQ=
=VoP8
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Alexei discovered a race condition in modules failing to load that can
cause a ftrace check to trigger and disable ftrace.
This is because of the way modules are registered to ftrace. Their
functions are loaded in the ftrace function tables but set to
"disabled" since they are still in the process of being loaded by the
module. After the module is finished, it calls back into the ftrace
infrastructure to enable it.
Looking deeper into the locations that access all the functions in the
table, I found more locations that should ignore the disabled ones"
* tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
Commit:
db4a835601 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
failed to verify that event->cgrp is actually the scheduled cgroup
in a CPU before setting cpuctx->cgrp. This patch fixes that.
Now that there is a different path for scheduled and unscheduled
cgroup, add a warning to catch when cpuctx->cgrp is still set after
the last cgroup event has been unsheduled.
To verify the bug:
# Create 2 cgroups.
mkdir /dev/cgroups/devices/g1
mkdir /dev/cgroups/devices/g2
# launch a task, bind it to a cpu and move it to g1
CPU=2
while :; do : ; done &
P=$!
taskset -pc $CPU $P
echo $P > /dev/cgroups/devices/g1/tasks
# monitor g2 (it runs no tasks) and observe output
perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000091408 7,579,527 cycles g2
2.000350111 <not counted> cycles g2
3.000589181 <not counted> cycles g2
4.000771428 <not counted> cycles g2
# note first line that displays that a task run in g2, despite
# g2 having no tasks. This is because cpuctx->cgrp was wrongly
# set when context of new event was installed.
# After applying the fix we obtain the right output:
perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000119615 <not counted> cycles g2
2.000389430 <not counted> cycles g2
3.000590962 <not counted> cycles g2
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
from Alexander Duyck.
2) Fix lockdep splats in iwlwifi, from Johannes Berg.
3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
is disabled, from Liping Zhang.
4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.
5) Disable UFO when path will apply IPSEC tranformations, from Jakub
Sitnicki.
6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.
7) skb_checksum_help() should never actually use the value "0" for the
resulting checksum, that has a special meaning, use CSUM_MANGLED_0
instead. From Eric Dumazet.
8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
Yuval MIntz.
9) Fix SCTP reference counting of associations and transports in
sctp_diag. From Xin Long.
10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
a previous layer or similar, so explicitly clear the ipv6 control
block in the skb. From Eli Cooper.
11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
Cong.
12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
Shenai.
13) Fix potential access past the end of the skb page frag array in
tcp_sendmsg(). From Eric Dumazet.
14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
from David Ahern.
15) Don't return an error in tcp_sendmsg() if we wronte any bytes
successfully, from Eric Dumazet.
16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
but forgot to purge these unlock calls. From Eric Dumazet.
17) Fix memory leak in error path of __genl_register_family(). We leak
the attrbuf, from WANG Cong.
18) cgroupstats netlink policy table is mis-sized, from WANG Cong.
19) Several XDP bug fixes in mlx5, from Saeed Mahameed.
20) Fix several device refcount leaks in network drivers, from Johan
Hovold.
21) icmp6_send() should use skb dst device not skb->dev to determine L3
routing domain. From David Ahern.
22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.
23) We leak new macvlan port in some cases of maclan_common_netlink()
errors. Fix from Gao Feng.
24) Similar to the icmp6_send() fix, icmp_route_lookup() should
determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
Also from David Ahern.
25) Several fixes for route offloading and FIB notification handling in
mlxsw driver, from Jiri Pirko.
26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.
27) Fix long standing regression in ipv4 redirect handling, wrt.
validating the new neighbour's reachability. From Stephen Suryaputra
Lin.
28) If sk_filter() trims the packet excessively, handle it reasonably in
tcp input instead of exploding. From Eric Dumazet.
29) Fix handling of napi hash state when copying channels in sfc driver,
from Bert Kenward.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
mlxsw: spectrum_router: Flush FIB tables during fini
net: stmmac: Fix lack of link transition for fixed PHYs
sctp: change sk state only when it has assocs in sctp_shutdown
bnx2: Wait for in-flight DMA to complete at probe stage
Revert "bnx2: Reset device during driver initialization"
ps3_gelic: fix spelling mistake in debug message
net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
ibmvnic: Fix size of debugfs name buffer
ibmvnic: Unmap ibmvnic_statistics structure
sfc: clear napi_hash state when copying channels
mlxsw: spectrum_router: Correctly dump neighbour activity
mlxsw: spectrum: Fix refcount bug on span entries
bnxt_en: Fix VF virtual link state.
bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
tcp: take care of truncations done by sk_filter()
ipv4: use new_gw for redirect neigh lookup
r8152: Fix error path in open function
net: bpqether.h: remove if_ether.h guard
net: __skb_flow_dissect() must cap its return value
...
When a module is first loaded and its function ip records are added to the
ftrace list of functions to modify, they are set to DISABLED, as their text
is still in a read only state. When the module is fully loaded, and can be
updated, the flag is cleared, and if their's any functions that should be
tracing them, it is updated at that moment.
But there's several locations that do record accounting and should ignore
records that are marked as disabled, or they can cause issues.
Alexei already fixed one location, but others need to be addressed.
Cc: stable@vger.kernel.org
Fixes: b7ffffbb46 "ftrace: Add infrastructure for delayed enabling of module functions"
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_shutdown() checks for sanity of ftrace records
and if dyn_ftrace->flags is not zero, it will warn.
It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
since some module was loaded, but before ftrace_module_enable()
cleared the flags for this module.
In other words the module.c is doing:
ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
... // here ftrace_shutdown() is called that warns, since
err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED
Fix it by ignoring disabled records.
It's similar to what __ftrace_hash_rec_update() is already doing.
Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com
Cc: stable@vger.kernel.org
Fixes: b7ffffbb46 "ftrace: Add infrastructure for delayed enabling of module functions"
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reverts commit bfd8d3f23b.
It turns out that this flushes things much too aggressiverly, and causes
lines to break up when the system logger races with new continuation
lines being printed.
There's a pending patch to make printk() flushing much more
straightforward, but it's too invasive for 4.9, so in the meantime let's
just not make the system message logging flush continuation lines.
They'll be flushed by the final newline anyway.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull irq fix from Ingo Molnar:
"This fixes a genirq regression that resulted in the Intel/Broxton
pinctrl/GPIO driver (and possibly others) spewing warnings"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Use irq type from irqdata instead of irqdesc
- Prevent the PM core from attempting to suspend parent devices
if any of their children, whose suspend callbacks were invoked
asynchronously, have failed to suspend during the "late" and
"noirq" phases of system-wide suspend of devices (Brian Norris).
- Prevent the boot-time system suspend test code from leaking a
reference to the RTC device used by it (Johan Hovold).
- Fix cpupower to use the return value of one of its library
functions correctly and restore the correct behavior of it
when used for setting cpufreq tunables broken during the 4.7
development cycle (Laura Abbott).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJYJloUAAoJEILEb/54YlRxkYgP/RvTRLO+6Hlk0Re/Lm5SS7O4
YjOrcemWkGkA+YJ5Qe4vki28nMzGITIAWuHJK7Nex0A1zFMX3GrNPxxMPHBejfog
7VjeEp77x0h9F8EzWX2YLZaTCSpvQjpHH5/vUQaTIE8a/WtZVbSzfCT4jqAYYHzV
rHn52aNwiAKd4kjMTSBJdjTOdhqebRFYRHpBHovoTARcGH4DgEJciQzaEh1O6qAq
barOGrDDkB32mULjsBNRX4Wyx1wuJiOpVjiD7sIM75ZomlIDC/RyYc9O37t6GDBC
koTTlilV2/i2VoqcU5nQgAmgUOoDz1skGJW6Rsp/d/S2GnZiLBJwma31e187otBL
NFl1egL1BJaz5J1sow7YPYBQILzlYGwNhXLiKHoPEshWp5WSn7mIhxxzAou3Llsp
oPCXDwuRrMhqYrOdclxiL+Pj6dUapvut3SerZB5C9rGg2JL9Y2gDxrhf/jFqXxNb
mBw3kRx94bDb6DNHJPuplRHWrL0SlM2ZXnUa7XwUSt+jIiwMK2la0cAL1P/7Qva9
zJ6SG01BnGEd6OI+y9ZSjITdpTVK10JIuCrTmR9cvWNA+A1vOGCkg6Nj+qunmzFh
j6W2O0KQ/moR0ICCPCg96Hjf888mohX7vhGQqDQwpXYlsvWcWvjNpURsfY47yBH9
p70+ziBkqyyIAprJKjLB
=ykEg
-----END PGP SIGNATURE-----
Merge tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two bugs in error code paths in the PM core (system-wide
suspend of devices), a device reference leak in the boot-time suspend
test code and a cpupower utility regression from the 4.7 cycle.
Specifics:
- Prevent the PM core from attempting to suspend parent devices if
any of their children, whose suspend callbacks were invoked
asynchronously, have failed to suspend during the "late" and
"noirq" phases of system-wide suspend of devices (Brian Norris).
- Prevent the boot-time system suspend test code from leaking a
reference to the RTC device used by it (Johan Hovold).
- Fix cpupower to use the return value of one of its library
functions correctly and restore the correct behavior of it when
used for setting cpufreq tunables broken during the 4.7 development
cycle (Laura Abbott)"
* tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
PM / sleep: fix device reference leak in test_suspend
cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
This reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path").
The reverted commit changes existing behavior on which many ARM boards
rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have
both a video output and a serial console. Depending on whether the user
is using the device as a more regular computer; or as a headless device
we need to have the console on either one or the other.
Many users rely on the kernel behavior of the console being present on
both outputs, before the reverted commit the console setup with no
console= kernel arguments on an ARM board which sets stdout-path in dt
would look like this:
[root@localhost ~]# cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
tty0 -WU (E p ) 4:1
Where as after the reverted commit, it looks like this:
[root@localhost ~]# cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
This commit reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path") restoring the original
behavior.
Fixes: 05fd007e46 ("console: don't prefer first registered if DT specifies stdout-path")
Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The type flags in the irq descriptor are there for historical reasons and
only updated via irq_modify_status() or irq_set_type(). Both functions also
update the type flags in irqdata. __setup_irq() is the only left over user
of the type flags in the irq descriptor.
If __setup_irq() is called with empty irq type flags, then the type flags
are retrieved from irqdata. If an interrupt is shared, then the type flags
are compared with the type flags stored in the irq descriptor.
On x86 the ioapic does not have a irq_set_type() callback because the type
is defined in the BIOS tables and cannot be changed. The type is stored in
irqdata at setup time without updating the type data in the irq
descriptor. As a result the comparison described above fails.
There is no point in updating the irq descriptor flags because the only
relevant storage is irqdata. Use the type flags from irqdata for both
retrieval and comparison in __setup_irq() instead.
Aside of that the print out in case of non matching type flags has the old
and new type flags arguments flipped. Fix that as well.
For correctness sake the flags stored in the irq descriptor should be
removed, but this is beyond the scope of this bugfix and will be done in a
later patch.
Fixes: 4b357daed6 ("genirq: Look-up trigger type if not specified by caller")
Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jon Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.
Fixes: a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull stack vmap fixups from Thomas Gleixner:
"Two small patches related to sched_show_task():
- make sure to hold a reference on the task stack while accessing it
- remove the thread_saved_pc printout
.. and add a sanity check into release_task_stack() to catch problems
with task stack references"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Remove pointless printout in sched_show_task()
sched/core: Fix oops in sched_show_task()
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fork: Add task stack refcounting sanity check and prevent premature task stack freeing
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
so we could end up accessing out-of-bound.
Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
this is safe because the rest are initialized to 0's.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.
Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.
Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make sure to drop the reference taken by class_find_device() after
opening the RTC device.
Fixes: 77437fd4e6 (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If something goes wrong with task stack refcounting and a stack
refcount hits zero too early, warn and leak it rather than
potentially freeing it early (and silently).
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Specifics:
- Fix a missing KERN_CONT in a system suspend message by converting
the affected code to using pr_info() and pr_cont() instead of the
"raw" printk() (Jon Hunter).
- Make intel_pstate set the CPU P-state from its .set_policy()
callback when the scaling_governor sysfs attribute is set to
"performance" so that it interacts with NOHZ_FULL more
predictably which was the case before 4.7 (Rafael Wysocki).
- Make intel_pstate always request the maximum allowed P-state when
the scaling_governor sysfs attribute is set to "performance" to
prevent it from effectively ingoring that setting is some
situations (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJYE+vbAAoJEILEb/54YlRxsqQQAIZLfm5Isp4L0JBt2svtLYuO
5mLgGOfcpsHIpqq5+saKtK+JOcCpm/MJ0PMQdv3WhKkFGXG9++Dz6F2gNNFlE+Sy
era/B9NFry1BsQ14xsX8XonYRkQ2tS+CgOw+akSZl9BTVeX57VASqPIzlKI4Auz/
Ru5M8n4QWAD7dPWnqHTFHYbmTnMtr18wn0mNjiC/cLz7TPxmGM9Pe730/bpGKMDC
DQpflnpT/ktqV1imDuaiMeuP2re2DSZWMrirZo326UsDtcKN1A4sCO+PPUUt/TTb
k6pKsZ9OB9FgMmaRqx5G6wCD3NvuhKa1JOqg3RvAVZ0SmMcl2AFhCaoEJtX+TKvz
3365o2tSQeGc3bmXTb7YqP2SaYm107rOG/M0I2JuA7qNUKxk8MFfbNdaYxTY2ezT
D5NF0SCycOMeEXIDTTuZ2oN1NHR6osO7bNFOIacKlcIL8PmEl8fDqX2KdYQze5xh
AGsa1Oee7FPrDd2kQIL6HNGgVjd6gtHU9cVjyzQeUv9SU6zwOlu4/BbZI+wIterN
L2OfW7w6/KSSyScDZ3TbifeJwpP918bUv578dEP6U/2VKthPBUPY/MYMMQzSuEIn
RAQoDShqacIjdyLZtStii28KctMOk0L/edEDbdFmyu2ttNwpB+hk2ieCYWfI8wAa
2SAaYyRGpC2nafsSTv+E
=RtkB
-----END PGP SIGNATURE-----
Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two intel_pstate issues related to the way it works when the
scaling_governor sysfs attribute is set to "performance" and fix up
messages in the system suspend core code.
Specifics:
- Fix a missing KERN_CONT in a system suspend message by converting
the affected code to using pr_info() and pr_cont() instead of the
"raw" printk() (Jon Hunter).
- Make intel_pstate set the CPU P-state from its .set_policy()
callback when the scaling_governor sysfs attribute is set to
"performance" so that it interacts with NOHZ_FULL more predictably
which was the case before 4.7 (Rafael Wysocki).
- Make intel_pstate always request the maximum allowed P-state when
the scaling_governor sysfs attribute is set to "performance" to
prevent it from effectively ingoring that setting is some
situations (Rafael Wysocki)"
* tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Always set max P-state in performance mode
PM / suspend: Fix missing KERN_CONT for suspend message
cpufreq: intel_pstate: Set P-state upfront in performance mode
Pull perf fixes from Ingo Molnar:
"Misc kernel fixes: a virtualization environment related fix, an uncore
PMU driver removal handling fix, a PowerPC fix and new events for
Knights Landing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
perf/powerpc: Don't call perf_event_disable() from atomic context
perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
perf/x86/intel/cstate: Add C-state residency events for Knights Landing
Pull timer fixes from Ingo Molnar:
"Fix four timer locking races: two were noticed by Linus while
reviewing the code while chasing for a corruption bug, and two
from fixing spurious USB timeouts"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Prevent base clock corruption when forwarding
timers: Prevent base clock rewind when forwarding clock
timers: Lock base for same bucket optimization
timers: Plug locking race vs. timer migration
Pull objtool, irq and scheduler fixes from Ingo Molnar:
"One more objtool fixlet for GCC6 code generation patterns, an irq
DocBook fix and an unused variable warning fix in the scheduler"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix rare switch jump table pattern detection
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
doc: Add missing parameter for msi_setup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Remove unused but set variable 'rq'
The trinity syscall fuzzer triggered following WARN() on powerpc:
WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
...
NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
Call Trace:
[c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
Followed by a lockdep warning:
===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc5+ #7 Tainted: G W
-------------------------------
./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ls/2998:
#0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
#1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
stack backtrace:
CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
Call Trace:
[c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
[c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
[c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
[c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
[c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
[c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
[c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.
The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.
But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CAI Qian reported a crash in the PMU uncore device removal code,
enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:
https://marc.info/?l=linux-kernel&m=147688837328451
The reason for the crash is that perf_pmu_unregister() tries to remove
a PMU device which is not added at this point. We add PMU devices
only after pmu_bus is registered, which happens in the
perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.
The fix is to get the 'pmu_bus_running' flag state at the point
the PMU is taken out of the PMU list and remove the device
later only if it's set.
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
in_interrupt() returns a nonzero value when we are either in an
interrupt or have bh disabled via local_bh_disable(). Since we are
interested in only ignoring coverage from actual interrupts, do a proper
check instead of just calling in_interrupt().
As a result of this change, kcov will start to collect coverage from
within local_bh_disable()/local_bh_enable() sections.
Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morse <james.morse@arm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block fixes from Jens Axboe:
"A set of fixes for this series, most notably the fix for the blk-mq
software queue regression in from this merge window.
Apart from that, a fix for an unlikely hang if a queue is flooded with
FUA requests from Ming, and a few small fixes for nbd and badblocks.
Lastly, a rename update for the proc softirq output, since the block
polling code was made generic"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: update hardware and software queues for sleeping alloc
block: flush: fix IO hang in case of flood fua req
nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown
badblocks: badblocks_set/clear update unacked_exist
softirq: Display IRQ_POLL for irq-poll statistics
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:
wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().
The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).
It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.
As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.
Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit:
8663e24d56 ("sched/fair: Reorder cgroup creation code")
... the variable 'rq' in alloc_fair_sched_group() is set but no longer used.
Remove it to fix the following GCC warning when building with 'W=1':
kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a timer is enqueued we try to forward the timer base clock. This
mechanism has two issues:
1) Forwarding a remote base unlocked
The forwarding function is called from get_target_base() with the current
timer base lock held. But if the new target base is a different base than
the current base (can happen with NOHZ, sigh!) then the forwarding is done
on an unlocked base. This can lead to corruption of base->clk.
Solution is simple: Invoke the forwarding after the target base is locked.
2) Possible corruption due to jiffies advancing
This is similar to the issue in get_net_timer_interrupt() which was fixed
in the previous patch. jiffies can advance between check and assignement
and therefore advancing base->clk beyond the next expiry value.
So we need to read jiffies into a local variable once and do the checks and
assignment with the local copy.
Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Ashton and Michael reported, that kernel versions 4.8 and later suffer from
USB timeouts which are caused by the timer wheel rework.
This is caused by a bug in the base clock forwarding mechanism, which leads
to timers expiring early. The scenario which leads to this is:
run_timers()
while (jiffies >= base->clk) {
collect_expired_timers();
base->clk++;
expire_timers();
}
So base->clk = jiffies + 1. Now the cpu goes idle:
idle()
get_next_timer_interrupt()
nextevt = __next_time_interrupt();
if (time_after(nextevt, base->clk))
base->clk = jiffies;
jiffies has not advanced since run_timers(), so this assignment effectively
decrements base->clk by one.
base->clk is the index into the timer wheel arrays. So let's assume the
following state after the base->clk increment in run_timers():
jiffies = 0
base->clk = 1
A timer gets enqueued with an expiry delta of 63 ticks (which is the case
with the USB timeout and HZ=250) so the resulting bucket index is:
base->clk + delta = 1 + 63 = 64
The timer goes into the first wheel level. The array size is 64 so it ends
up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
to index into bucket 0 again.
If the cpu goes idle before jiffies advance, then the bug in the forwarding
mechanism sets base->clk back to 0, so the next invocation of run_timers()
at the next tick will index into bucket 0 and therefore expire the timer 62
ticks too early.
Instead of blindly setting base->clk to jiffies we must make the forwarding
conditional on jiffies > base->clk, but we cannot use jiffies for this as
we might run into the following issue:
if (time_after(jiffies, base->clk) {
if (time_after(nextevt, base->clk))
base->clk = jiffies;
jiffies can increment between the check and the assigment far enough to
advance beyond nextevt. So we need to use a stable value for checking.
get_next_timer_interrupt() has the basej argument which is the jiffies
value snapshot taken in the calling code. So we can just that.
Thanks to Ashton for bisecting and providing trace data!
Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Linus stumbled over the unlocked modification of the timer expiry value in
mod_timer() which is an optimization for timers which stay in the same
bucket - due to the bucket granularity - despite their expiry time getting
updated.
The optimization itself still makes sense even if we take the lock, because
in case that the bucket stays the same, we avoid the pointless
queue/enqueue dance.
Make the check and the modification of timer->expires protected by the base
lock and shuffle the remaining code around so we can keep the lock held
when we actually have to requeue the timer to a different bucket.
Fixes: f00c0afdfa ("timers: Implement optimization for same expiry time in mod_timer()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
timer flags. As a consequence the compiler is allowed to reload the flags
between the initial check for TIMER_MIGRATION and the following timer base
computation and the spin lock of the base.
While this has not been observed (yet), we need to make sure that it never
happens.
Fixes: 0eeda71bc3 ("timer: Replace timer base by a cpu index")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Commit 4bcc595ccd (printk: reinstate KERN_CONT for printing
continuation lines) exposed a missing KERN_CONT from one of the
messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
after syncing the filesystems no longer appears as a continuation but
a new message with its own timestamp.
[ 9.259566] PM: Syncing filesystems ... [ 9.264119] done.
Fix this by adding the KERN_CONT log level for the 'done.' part of the
message seen after syncing filesystems. While we are at it, convert
these suspend printks to pr_info and pr_cont, respectively.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull irq fixes from Ingo Molnar:
"Mostly irqchip driver fixes, plus a symbol export"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/irq: Export irq_set_parent()
irqchip/gic: Add missing \n to CPU IF adjustment message
irqchip/jcore: Don't show Kconfig menu item for driver
irqchip/eznps: Drop pointless static qualifier in nps400_of_init()
irqchip/gic-v3-its: Fix entry size mask for GITS_BASER
irqchip/gic-v3-its: Fix 64bit GIC{R,ITS}_TYPER accesses
This library was moved to the generic area and was
renamed to irq-poll. Hence, update proc/softirqs output accordingly.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The TPS65217 driver grew interrupt support which uses
irq_set_parent(). While it's not yet clear why this is used in the first
place, building the driver as a module fails with:
ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined!
The correctness of the driver change is still investigated, but for now
it's less trouble to export irq_set_parent() than dealing with the build
wreckage.
[ tglx: Rewrote changelog and made the export GPL ]
Fixes: 6556bdacf6 ("mfd: tps65217: Add support for IRQs")
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Marcin Niestroj <m.niestroj@grinn-global.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Lee Jones <lee.jones@linaro.org>
Link: http://lkml.kernel.org/r/1475775403-27207-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull scheduler fix from Ingo Molnar:
"This fixes a group scheduling related performance/interactivity
regression introduced in v4.8, which affects certain hardware
environments where cpu_possible_mask != cpu_present_mask"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix incorrect task group ->load_avg
We have a fairly common pattern where you print several things as
continuations on one single line in a loop, and then at the end you do
printk(KERN_CONT "\n");
to flush the buffered output.
But if the output was flushed by something else (concurrent printk
activity, or just system logging), we don't want that final flushing to
just print an empty line.
So just suppress empty continuation lines when they couldn't be merged
into the line they are a continuation of.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge the gup_flags cleanups from Lorenzo Stoakes:
"This patch series adjusts functions in the get_user_pages* family such
that desired FOLL_* flags are passed as an argument rather than
implied by flags.
The purpose of this change is to make the use of FOLL_FORCE explicit
so it is easier to grep for and clearer to callers that this flag is
being used. The use of FOLL_FORCE is an issue as it overrides missing
VM_READ/VM_WRITE flags for the VMA whose pages we are reading
from/writing to, which can result in surprising behaviour.
The patch series came out of the discussion around commit 38e0885465
("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
which addressed a BUG_ON() being triggered when a page was faulted in
with PROT_NONE set but having been overridden by FOLL_FORCE.
do_numa_page() was run on the assumption the page _must_ be one marked
for NUMA node migration as an actual PROT_NONE page would have been
dealt with prior to this code path, however FOLL_FORCE introduced a
situation where this assumption did not hold.
See
https://marc.info/?l=linux-mm&m=147585445805166
for the patch proposal"
Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.
[ This branch was rebased recently to add a few more acked-by's and
reviewed-by's ]
* gup_flag-cleanups:
mm: replace access_process_vm() write parameter with gup_flags
mm: replace access_remote_vm() write parameter with gup_flags
mm: replace __access_remote_vm() write parameter with gup_flags
mm: replace get_user_pages_remote() write/force parameters with gup_flags
mm: replace get_user_pages() write/force parameters with gup_flags
mm: replace get_vaddr_frames() write/force parameters with gup_flags
mm: replace get_user_pages_locked() write/force parameters with gup_flags
mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
mm: remove write/force parameters from __get_user_pages_unlocked()
mm: remove write/force parameters from __get_user_pages_locked()
mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
This removes the 'write' argument from access_process_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>