It seems like when ldsem_down_read() fails with timeout, it misses
update for sem->wait_readers. By that reason, when writer finally
releases write end of the semaphore __ldsem_wake_readers() does adjust
sem->count with wrong value:
sem->wait_readers * (LDSEM_ACTIVE_BIAS - LDSEM_WAIT_BIAS)
I.e, if update comes with 1 missed wait_readers decrement, sem->count
will be 0x100000001 which means that there is active reader and it'll
make any further writer to fail in acquiring the semaphore.
It looks like, this is a dead-code, because ldsem_down_read() is never
called with timeout different than MAX_SCHEDULE_TIMEOUT, so it might be
worth to delete timeout parameter and error path fall-back..
Cc: Jiri Slaby <jslaby@suse.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For some reason ldsem has its own lockdep wrappers, make them go away.
Cc: Jiri Slaby <jslaby@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ldsem_down_read() will sleep if there is pending writer in the queue.
If the writer times out, readers in the queue should be woken up,
otherwise they may miss a chance to acquire the semaphore until the last
active reader will do ldsem_up_read().
There was a couple of reports where there was one active reader and
other readers soft locked up:
Showing all locks held in the system:
2 locks held by khungtaskd/17:
#0: (rcu_read_lock){......}, at: watchdog+0x124/0x6d1
#1: (tasklist_lock){.+.+..}, at: debug_show_all_locks+0x72/0x2d3
2 locks held by askfirst/123:
#0: (&tty->ldisc_sem){.+.+.+}, at: ldsem_down_read+0x46/0x58
#1: (&ldata->atomic_read_lock){+.+...}, at: n_tty_read+0x115/0xbe4
Prevent readers wait for active readers to release ldisc semaphore.
Link: lkml.kernel.org/r/20171121132855.ajdv4k6swzhvktl6@wfg-t540p.sh.intel.com
Link: lkml.kernel.org/r/20180907045041.GF1110@shao2-debian
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mark found ldsem_cmpxchg() needed an (atomic_long_t *) cast to keep
working after making the atomic_long interface type safe.
Needing casts is bad form, which made me look at the code. There are no
ld_semaphore::count users outside of these functions so there is no
reason why it can not be an atomic_long_t in the first place, obviating
the need for this cast.
That also ensures the loads use atomic_long_read(), which implies (at
least) READ_ONCE() in order to guarantee single-copy-atomic loads.
When using atomic_long_try_cmpxchg() the ldsem_cmpxchg() wrapper gets
very thin (the only difference is not changing *old on success, which
most callers don't seem to care about).
So rework the whole thing to use atomic_long_t and its accessors
directly.
While there, fixup all the horrible comment styles.
Cc: Peter Hurley <peter@hurleysoftware.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that the SPDX tag is in all tty files, that identifies the license
in a specific and legally-defined manner. So the extra GPL text wording
can be removed as it is no longer needed at all.
This is done on a quest to remove the 700+ different ways that files in
the kernel describe the GPL license text. And there's unneeded stuff
like the address (sometimes incorrect) for the FSF which is never
needed.
No copyright headers or other non-license-description text was removed.
Cc: Jiri Slaby <jslaby@suse.com>
Cc: James Hogan <jhogan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.
Update the drivers/tty files files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of
the full boiler plate text.
This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Stefan Wahren <stefan.wahren@i2se.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: bcm-kernel-feedback-list@broadcom.com
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Joachim Eastwood <manabian@gmail.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Richard Genoud <richard.genoud@gmail.com>
Cc: Alexander Shiyan <shc_work@mail.ru>
Cc: Baruch Siach <baruch@tkos.co.il>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: "Uwe Kleine-König" <kernel@pengutronix.de>
Cc: Pat Gefre <pfg@sgi.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Vladimir Zapolskiy <vz@mleia.com>
Cc: Sylvain Lemieux <slemieux.tyco@gmail.com>
Cc: Carlo Caione <carlo@caione.org>
Cc: Kevin Hilman <khilman@baylibre.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Andy Gross <andy.gross@linaro.org>
Cc: David Brown <david.brown@linaro.org>
Cc: "Andreas Färber" <afaerber@suse.de>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Laxman Dewangan <ldewangan@nvidia.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Jonathan Hunter <jonathanh@nvidia.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Patrice Chotard <patrice.chotard@st.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Korsgaard <jacmet@sunsite.dk>
Cc: Timur Tabi <timur@tabi.org>
Cc: Tony Prisk <linux@prisktech.co.nz>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: "Sören Brinkmann" <soren.brinkmann@xilinx.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Jiri Slaby <jslaby@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
But first update usage sites with the new header dependency.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a nasty interface and setting the state of a foreign task must
not be done. As of the following commit:
be628be095 ("bcache: Make gc wakeup sane, remove set_task_state()")
... everyone in the kernel calls set_task_state() with current, allowing
the helper to be removed.
However, as the comment indicates, it is still around for those archs
where computing current is more expensive than using a pointer, at least
in theory. An important arch that is affected is arm64, however this has
been addressed now [1] and performance is up to par making no difference
with either calls.
Of all the callers, if any, it's the locking bits that would care most
about this -- ie: we end up passing a tsk pointer to a lot of the lock
slowpath, and setting ->state on that. The following numbers are based
on two tests: a custom ad-hoc microbenchmark that just measures
latencies (for ~65 million calls) between get_task_state() vs
get_current_state().
Secondly for a higher overview, an unlink microbenchmark was used,
which pounds on a single file with open, close,unlink combos with
increasing thread counts (up to 4x ncpus). While the workload is quite
unrealistic, it does contend a lot on the inode mutex or now rwsem.
[1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com
== 1. x86-64 ==
Avg runtime set_task_state(): 601 msecs
Avg runtime set_current_state(): 552 msecs
vanilla dirty
Hmean unlink1-processes-2 36089.26 ( 0.00%) 38977.33 ( 8.00%)
Hmean unlink1-processes-5 28555.01 ( 0.00%) 29832.55 ( 4.28%)
Hmean unlink1-processes-8 37323.75 ( 0.00%) 44974.57 ( 20.50%)
Hmean unlink1-processes-12 43571.88 ( 0.00%) 44283.01 ( 1.63%)
Hmean unlink1-processes-21 34431.52 ( 0.00%) 38284.45 ( 11.19%)
Hmean unlink1-processes-30 34813.26 ( 0.00%) 37975.17 ( 9.08%)
Hmean unlink1-processes-48 37048.90 ( 0.00%) 39862.78 ( 7.59%)
Hmean unlink1-processes-79 35630.01 ( 0.00%) 36855.30 ( 3.44%)
Hmean unlink1-processes-110 36115.85 ( 0.00%) 39843.91 ( 10.32%)
Hmean unlink1-processes-141 32546.96 ( 0.00%) 35418.52 ( 8.82%)
Hmean unlink1-processes-172 34674.79 ( 0.00%) 36899.21 ( 6.42%)
Hmean unlink1-processes-203 37303.11 ( 0.00%) 36393.04 ( -2.44%)
Hmean unlink1-processes-224 35712.13 ( 0.00%) 36685.96 ( 2.73%)
== 2. ppc64le ==
Avg runtime set_task_state(): 938 msecs
Avg runtime set_current_state: 940 msecs
vanilla dirty
Hmean unlink1-processes-2 19269.19 ( 0.00%) 30704.50 ( 59.35%)
Hmean unlink1-processes-5 20106.15 ( 0.00%) 21804.15 ( 8.45%)
Hmean unlink1-processes-8 17496.97 ( 0.00%) 17243.28 ( -1.45%)
Hmean unlink1-processes-12 14224.15 ( 0.00%) 17240.21 ( 21.20%)
Hmean unlink1-processes-21 14155.66 ( 0.00%) 15681.23 ( 10.78%)
Hmean unlink1-processes-30 14450.70 ( 0.00%) 15995.83 ( 10.69%)
Hmean unlink1-processes-48 16945.57 ( 0.00%) 16370.42 ( -3.39%)
Hmean unlink1-processes-79 15788.39 ( 0.00%) 14639.27 ( -7.28%)
Hmean unlink1-processes-110 14268.48 ( 0.00%) 14377.40 ( 0.76%)
Hmean unlink1-processes-141 14023.65 ( 0.00%) 16271.69 ( 16.03%)
Hmean unlink1-processes-172 13417.62 ( 0.00%) 16067.55 ( 19.75%)
Hmean unlink1-processes-203 15293.08 ( 0.00%) 15440.40 ( 0.96%)
Hmean unlink1-processes-234 13719.32 ( 0.00%) 16190.74 ( 18.01%)
Hmean unlink1-processes-265 16400.97 ( 0.00%) 16115.22 ( -1.74%)
Hmean unlink1-processes-296 14388.60 ( 0.00%) 16216.13 ( 12.70%)
Hmean unlink1-processes-320 15771.85 ( 0.00%) 15905.96 ( 0.85%)
x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
and ppc64 (with paca) show similar improvements in the unlink microbenches.
The small delta for ppc64 (2ms), does not represent the gains on the unlink
runs. In the case of x86, there was a decent amount of variation in the
latency runs, but always within a 20 to 50ms increase), ppc was more constant.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch effectively replaces the tsk pointer dereference
(which is obviously == current), to directly use get_current()
macro. This is to make the removal of setting foreign task
states smoother and painfully obvious. Performance win on some
archs such as x86-64 and ppc64 -- arm64 is no longer an issue.
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should not be doing assignments within an if () block
so fix up the code to not do this.
change was created using Coccinelle.
CC: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "int check" argument of lock_acquire() and held_lock->check are
misleading. This is actually a boolean: 2 means "true", everything
else is "false".
And there is no need to pass 1 or 0 to lock_acquire() depending on
CONFIG_PROVE_LOCKING, __lock_acquire() checks prove_locking at the
start and clears "check" if !CONFIG_PROVE_LOCKING.
Note: probably we can simply kill this member/arg. The only explicit
user of check => 0 is rcu_lock_acquire(), perhaps we can change it to
use lock_acquire(trylock =>, read => 2). __lockdep_no_validate means
check => 0 implicitly, but we can change validate_chain() to check
hlock->instance->key instead. Not to mention it would be nice to get
rid of lockdep_set_novalidate_class().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140120182006.GA26495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a controlling tty is being hung up and the hang up is
waiting for a just-signalled tty reader or writer to exit, and a new tty
reader/writer tries to acquire an ldisc reference concurrently with the
ldisc reference release from the signalled reader/writer, the hangup
can hang. The new reader/writer is sleeping in ldsem_down_read() and the
hangup is sleeping in ldsem_down_write() [1].
The new reader/writer fails to wakeup the waiting hangup because the
wrong lock count value is checked (the old lock count rather than the new
lock count) to see if the lock is unowned.
Change helper function to return the new lock count if the cmpxchg was
successful; document this behavior.
[1] edited dmesg log from reporter
SysRq : Show Blocked State
task PC stack pid father
systemd D ffff88040c4f0000 0 1 0 0x00000000
ffff88040c49fbe0 0000000000000046 ffff88040c4a0000 ffff88040c49ffd8
00000000001d3980 00000000001d3980 ffff88040c4a0000 ffff88040593d840
ffff88040c49fb40 ffffffff810a4cc0 0000000000000006 0000000000000023
Call Trace:
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff817a6649>] schedule+0x24/0x5e
[<ffffffff817a588b>] schedule_timeout+0x15b/0x1ec
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff817aa691>] ? _raw_spin_unlock_irq+0x24/0x26
[<ffffffff817aa10c>] down_read_failed+0xe3/0x1b9
[<ffffffff817aa26d>] ldsem_down_read+0x8b/0xa5
[<ffffffff8142b5ca>] ? tty_ldisc_ref_wait+0x1b/0x44
[<ffffffff8142b5ca>] tty_ldisc_ref_wait+0x1b/0x44
[<ffffffff81423f5b>] tty_write+0x7d/0x28a
[<ffffffff814241f5>] redirected_tty_write+0x8d/0x98
[<ffffffff81424168>] ? tty_write+0x28a/0x28a
[<ffffffff8115d03f>] do_loop_readv_writev+0x56/0x79
[<ffffffff8115e604>] do_readv_writev+0x1b0/0x1ff
[<ffffffff8116ea0b>] ? do_vfs_ioctl+0x32a/0x489
[<ffffffff81167d9d>] ? final_putname+0x1d/0x3a
[<ffffffff8115e6c7>] vfs_writev+0x2e/0x49
[<ffffffff8115e7d3>] SyS_writev+0x47/0xaa
[<ffffffff817ab822>] system_call_fastpath+0x16/0x1b
bash D ffffffff81c104c0 0 5469 5302 0x00000082
ffff8800cf817ac0 0000000000000046 ffff8804086b22a0 ffff8800cf817fd8
00000000001d3980 00000000001d3980 ffff8804086b22a0 ffff8800cf817a48
000000000000b9a0 ffff8800cf817a78 ffffffff81004675 ffff8800cf817a44
Call Trace:
[<ffffffff81004675>] ? dump_trace+0x165/0x29c
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff8100edda>] ? save_stack_trace+0x26/0x41
[<ffffffff817a6649>] schedule+0x24/0x5e
[<ffffffff817a588b>] schedule_timeout+0x15b/0x1ec
[<ffffffff810a4cc0>] ? sched_clock_cpu+0x9f/0xe4
[<ffffffff817a9f03>] ? down_write_failed+0xa3/0x1c9
[<ffffffff817aa691>] ? _raw_spin_unlock_irq+0x24/0x26
[<ffffffff817a9f0b>] down_write_failed+0xab/0x1c9
[<ffffffff817aa300>] ldsem_down_write+0x79/0xb1
[<ffffffff817aada3>] ? tty_ldisc_lock_pair_timeout+0xa5/0xd9
[<ffffffff817aada3>] tty_ldisc_lock_pair_timeout+0xa5/0xd9
[<ffffffff8142bf33>] tty_ldisc_hangup+0xc4/0x218
[<ffffffff81423ab3>] __tty_hangup+0x2e2/0x3ed
[<ffffffff81424a76>] disassociate_ctty+0x63/0x226
[<ffffffff81078aa7>] do_exit+0x79f/0xa11
[<ffffffff81086bdb>] ? get_signal_to_deliver+0x206/0x62f
[<ffffffff810b4bfb>] ? lock_release_holdtime.part.8+0xf/0x16e
[<ffffffff81079b05>] do_group_exit+0x47/0xb5
[<ffffffff81086c16>] get_signal_to_deliver+0x241/0x62f
[<ffffffff810020a7>] do_signal+0x43/0x59d
[<ffffffff810f2af7>] ? __audit_syscall_exit+0x21a/0x2a8
[<ffffffff810b4bfb>] ? lock_release_holdtime.part.8+0xf/0x16e
[<ffffffff81002655>] do_notify_resume+0x54/0x6c
[<ffffffff817abaf8>] int_signal+0x12/0x17
Reported-by: Sami Farin <sami.farin@gmail.com>
Cc: <stable@vger.kernel.org> # 3.12.x
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The semantics of a rw semaphore are almost ideally suited
for tty line discipline lifetime management; multiple active
threads obtain "references" (read locks) while performing i/o
to prevent the loss or change of the current line discipline
(write lock).
Unfortunately, the existing rw_semaphore is ill-suited in other
ways;
1) TIOCSETD ioctl (change line discipline) expects to return an
error if the line discipline cannot be exclusively locked within
5 secs. Lock wait timeouts are not supported by rwsem.
2) A tty hangup is expected to halt and scrap pending i/o, so
exclusive locking must be prioritized.
Writer priority is not supported by rwsem.
Add ld_semaphore which implements these requirements in a
semantically similar way to rw_semaphore.
Writer priority is handled by separate wait lists for readers and
writers. Pending write waits are priortized before existing read
waits and prevent further read locks.
Wait timeouts are trivially added, but obviously change the lock
semantics as lock attempts can fail (but only due to timeout).
This implementation incorporates the write-lock stealing work of
Michel Lespinasse <walken@google.com>.
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>