Commit Graph

37229 Commits

Author SHA1 Message Date
Suren Baghdasaryan dce4910396 mm: wire up syscall process_mrelease
Split off from prev patch in the series that implements the syscall.

Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:17 -07:00
Charan Teja Reddy 65d759c8f9 mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness.  Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM.  Enabling the proactive
compaction in its state will endup in running almost always on such
systems.

Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness).  So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface.  As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.

This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a

[akpm@linux-foundation.org: tweak vm.rst, per Mike]

Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:17 -07:00
Vasily Averin c509723ec2 memcg: enable accounting for posix_timers_cache slab
A program may create multiple interval timers using timer_create().  For
each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit.  The allocated object is quite small, ~250 bytes, but even
the default signal limits allow to consume up to 100 megabytes per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/57795560-025c-267c-6b1a-dea852d95530@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:13 -07:00
Vasily Averin 5f58c39819 memcg: enable accounting for signals
When a user send a signal to any another processes it forces the kernel to
allocate memory for 'struct sigqueue' objects.  The number of signals is
limited by RLIMIT_SIGPENDING resource limit, but even the default settings
allow each user to consume up to several megabytes of memory.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:12 -07:00
Vasily Averin 30acd0bdfb memcg: enable accounting for new namesapces and struct nsproxy
Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.

Net and uts namespaces have enabled accounting for such allocations.  It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:12 -07:00
Vasily Averin fab827dbee memcg: enable accounting for pids in nested pid namespaces
Commit 5d097056c9 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces.  As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.

Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.

Though recently I've noticed that it is wrong.  nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces.  The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.

Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.

Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
Fixes: 5d097056c9 ("kmemcg: account certain kmem allocations to memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:12 -07:00
David Hildenbrand 8d0920bde5 mm: remove VM_DENYWRITE
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2021-09-03 18:42:01 +02:00
David Hildenbrand fe69d560b5 kernel/fork: always deny write access to current MM exe_file
We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().

Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.

Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.

This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2021-09-03 18:42:01 +02:00
David Hildenbrand 35d7bdc860 kernel/fork: factor out replacing the current MM exe_file
Let's factor the main logic out into replace_mm_exe_file(), such that
all mm->exe_file logic is contained in kernel/fork.c.

While at it, perform some simple cleanups that are possible now that
we're simplifying the individual functions.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2021-09-03 18:42:01 +02:00
Thomas Gleixner 340576590d futex: Avoid redundant task lookup
No need to do the full VPID based task lookup and validation of the top
waiter when the user space futex was acquired on it's behalf during the
requeue_pi operation. The task is known already and it cannot go away
before requeue_pi_wake_futex() has been invoked.

Split out the actual attach code from attach_pi_state_owner() and use that
instead of the full blown variant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210902094414.676104881@linutronix.de
2021-09-02 22:07:18 +02:00
Thomas Gleixner 249955e51c futex: Clarify comment for requeue_pi_wake_futex()
It's slightly confusing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210902094414.618613025@linutronix.de
2021-09-02 22:07:18 +02:00
Thomas Gleixner 4f07ec0d76 futex: Prevent inconsistent state and exit race
The recent rework of the requeue PI code introduced a possibility for
going back to user space in inconsistent state:

CPU 0				CPU 1

requeue_futex()
  if (lock_pifutex_user()) {
      dequeue_waiter();
      wake_waiter(task);
				sched_in(task);
     				return_from_futex_syscall();

  ---> Inconsistent state because PI state is not established

It becomes worse if the woken up task immediately exits:

				sys_exit();
				
      attach_pistate(vpid);	<--- FAIL


Attach the pi state before dequeuing and waking the waiter. If the waiter
gets a spurious wakeup before the dequeue operation it will wait in
futex_requeue_pi_wakeup_sync() and therefore cannot return and exit.

Fixes: 07d91ef510 ("futex: Prevent requeue_pi() lock nesting issue on RT")
Reported-by: syzbot+4d1bd0725ef09168e1a0@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210902094414.558914045@linutronix.de
2021-09-02 22:07:18 +02:00
Colin Ian King a974b54036 futex: Return error code instead of assigning it without effect
The check on the rt_waiter and top_waiter->pi_state is assigning an error
return code to ret but this later gets re-assigned, hence the check is
ineffective.

Return -EINVAL rather than assigning it to ret which was the original
intent.

Fixes: dc7109aaa2 ("futex: Validate waiter correctly in futex_proxy_trylock_atomic()")
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210818131840.34262-1-colin.king@canonical.com
2021-09-02 22:07:18 +02:00
Mike Galbraith 15eb7c888e locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT
730633f0b7 became the first direct caller of __init_rwsem() vs the
usual init_rwsem(), exposing PREEMPT_RT's lack thereof.  Add it.

[ tglx: Move it out of line ]

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/50a936b7d8f12277d6ec7ed2ef0421a381056909.camel@gmx.de
2021-09-02 22:07:17 +02:00
Linus Torvalds aa829778b1 LKMM updates:
- Update documentation and code example
 
 KCSAN updates:
 
  - Introduce CONFIG_KCSAN_STRICT (which RCU uses)
  - Optimize use of get_ctx() by kcsan_found_watchpoint()
  - Rework atomic.h into permissive.h
  - Add the ability to ignore writes that change only one bit of a given data-racy variable.
  - Improve comments
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEvHeQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g2pw/+NBZ4TGHc7pcxVbKQUdiOZ+BJEQG0OfUZ
 bDQbHDsEjjXXav9udGABJb9n/zbh+08Z+ME9H4RAIG7pLUMcITOVSpCdF7x8USFI
 T5Z5IkN0gKUXMyNGzxA28Q0nTMl7F2ndlwygrKUeVXEMjnX6QtkILaXWiFesJSkR
 bBNDUAYfk5+9kMbyx2RQ9VlNpHh0J6sYIwbHsbPOakw2dJaHLhusEUVEWIIGG7uy
 aCyCDY6kT5GWXIFDZIUSHs2wYH3calNJ77nD0JWpD/6XWCbiS87F0zKtW9y3kfA4
 0Lk6mJJDNSqUjbF0Y2sBYkuaisJI5lLxK1mfYET3AN7E4ZfDY9BJFXfMEPtMaKbD
 pW9p3CAiiYsJzg9976tHq4m7iO75nV2I29ZAv913oNSdOkxwH3ahE+6TFkAgNXnT
 57j0QFUS8MxpxxqBFcY45ukmDjzeQ0SKIVpWQnfCQ8mlfvTJhBrllXIdhEenVtZK
 b83k3Mer/wYmfpLpQVsUYck/Yk0QV1doywOv/BVbwhxbMeEI27BvoGpGZTVVPEZF
 VhhdC4sbySG1Eva0wiJ2abyC9u5INlxOxRuSjSGqNJhTXwDWTw4gfWmoowm1sCVz
 uoJoR2bzZzDGda6C5g8OvXATeQR/DtjjLPI7V9MEhLfVFukHKQDaNYXkAxZb7cyn
 IepaMtiJ0RA=
 =99fO
 -----END PGP SIGNATURE-----

Merge tag 'locking-debug-2021-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull memory model updates from Ingo Molnar:
 "LKMM updates:

   - Update documentation and code example

  KCSAN updates:

   - Introduce CONFIG_KCSAN_STRICT (which RCU uses)

   - Optimize use of get_ctx() by kcsan_found_watchpoint()

   - Rework atomic.h into permissive.h

   - Add the ability to ignore writes that change only one bit of a
     given data-racy variable.

   - Improve comments"

* tag 'locking-debug-2021-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools/memory-model: Document data_race(READ_ONCE())
  tools/memory-model: Heuristics using data_race() must handle all values
  tools/memory-model: Add example for heuristic lockless reads
  tools/memory-model: Make read_foo_diagnostic() more clearly diagnostic
  kcsan: Make strict mode imply interruptible watchers
  kcsan: permissive: Ignore data-racy 1-bit value changes
  kcsan: Print if strict or non-strict during init
  kcsan: Rework atomic.h into permissive.h
  kcsan: Reduce get_ctx() uses in kcsan_found_watchpoint()
  kcsan: Introduce CONFIG_KCSAN_STRICT
  kcsan: Remove CONFIG_KCSAN_DEBUG
  kcsan: Improve some Kconfig comments
2021-09-02 13:00:15 -07:00
Linus Torvalds 4a3bb4200a dma-mapping updates for Linux 5.15
- fix debugfs initialization order (Anthony Iliopoulos)
  - use memory_intersects() directly (Kefeng Wang)
  - allow to return specific errors from ->map_sg
    (Logan Gunthorpe, Martin Oliveira)
  - turn the dma_map_sg return value into an unsigned int (me)
  - provide a common global coherent pool іmplementation (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmEvY+8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPaehAAsgnBzzzoLHO83pgs0KL92c+0DiMNHYmaMCJOvZXk
 x2Irv+O74WikRJc4S7uQ26p2spjmUxjmiOjld+8+NN0liD4QO9BQ/SZpIp8emuKS
 /yPG6Xh86xSl/OrPL1y7kGeHkRi5sm3mRhcTdILFQFPLcSReupe++GRfnvrpbOPk
 tj3pBGXluD6iJH12BBt00ushUVzZ0F2xaF6xUDAs94RSZ3tlqsfx6c928Y1KxSZh
 f89q/KuaokyogFG7Ujj/nYgIUETaIs2W6UmxBfRzdEMJFSffwomUMbw+M+qGJ7/d
 2UjamFYRX16FReE8WNsndbX1E6k5JBW12E1qwV3dUwatlNLWEaRq3PNiWkF7zcFH
 LDkpDYN6s5bIDPTfDp21XfPygoH8KQhnD9lVf0aB7n04uu8VJrGB9+10PpkCJVXD
 0b2dcuSwCO7hAfTfNGVV8f3EI/1XPflr1hJvMgcVtY53CR96ldp+4QaElzWLXumN
 MyptirmrVITNVyVwGzhGAblXBLWdarXD0EXudyiaF4Xbrj3AkIOSUCghEwKLpjQf
 UwMFFwSE8yGxKTRK4HfU5gMzy6G751fU7TUe5lmxZLovDflQoSXMWgHE8e7r0Qel
 o5v6lmUzoWz2fAISf3xjauo2ncgmfWMwYM6C7OJy5nG73QXLQId9J+ReXbmrgrrN
 DgI=
 =spje
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.15' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - fix debugfs initialization order (Anthony Iliopoulos)

 - use memory_intersects() directly (Kefeng Wang)

 - allow to return specific errors from ->map_sg (Logan Gunthorpe,
   Martin Oliveira)

 - turn the dma_map_sg return value into an unsigned int (me)

 - provide a common global coherent pool іmplementation (me)

* tag 'dma-mapping-5.15' of git://git.infradead.org/users/hch/dma-mapping: (31 commits)
  hexagon: use the generic global coherent pool
  dma-mapping: make the global coherent pool conditional
  dma-mapping: add a dma_init_global_coherent helper
  dma-mapping: simplify dma_init_coherent_memory
  dma-mapping: allow using the global coherent pool for !ARM
  ARM/nommu: use the generic dma-direct code for non-coherent devices
  dma-direct: add support for dma_coherent_default_memory
  dma-mapping: return an unsigned int from dma_map_sg{,_attrs}
  dma-mapping: disallow .map_sg operations from returning zero on error
  dma-mapping: return error code from dma_dummy_map_sg()
  x86/amd_gart: don't set failed sg dma_address to DMA_MAPPING_ERROR
  x86/amd_gart: return error code from gart_map_sg()
  xen: swiotlb: return error code from xen_swiotlb_map_sg()
  parisc: return error code from .map_sg() ops
  sparc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
  sparc/iommu: return error codes from .map_sg() ops
  s390/pci: don't set failed sg dma_address to DMA_MAPPING_ERROR
  s390/pci: return error code from s390_dma_map_sg()
  powerpc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
  powerpc/iommu: return error code from .map_sg() ops
  ...
2021-09-02 10:32:06 -07:00
Daniel Borkmann 49ca615320 bpf: Relicense disassembler as GPL-2.0-only OR BSD-2-Clause
Some time ago we dual-licensed both libbpf and bpftool through commits
1bc38b8ff6 ("libbpf: relicense libbpf as LGPL-2.1 OR BSD-2-Clause")
and 907b223651 ("tools: bpftool: dual license all files"). The latter
missed the disasm.{c,h} which we pull in via kernel/bpf/ such that we
have a single source for verifier as well as bpftool asm dumping, see
also f4ac7e0b5c ("bpf: move instruction printing into a separate file").
It is currently GPL-2.0-only and missed the conversion in 907b223651,
therefore relicense the two as GPL-2.0-only OR BSD-2-Clause as well.

Spotted-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
2021-09-02 14:49:23 +02:00
Linus Torvalds df43d90382 printk changes for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmEt+hwACgkQUqAMR0iA
 lPLppBAAiyrUNVmqqtdww+IJajEs1uD/4FqPsysHRwroHBFymJeQG1XCwUpDZ7jj
 6gXT0chxyjQE18gT/W9nf+PSmA9XvIVA1WSR+WCECTNW3YoZXqtgwiHfgnitXYku
 HlmoZLthYeuoXWw2wn+hVLfTRh6VcPHYEaC21jXrs6B1pOXHbvjJ5eTLHlX9oCfL
 UKSK+jFTHAJcn/GskRzviBe0Hpe8fqnkRol2XX13ltxqtQ73MjaGNu7imEH6/Pa7
 /MHXWtuWJtOvuYz17aztQP4Qwh1xy+kakMy3aHucdlxRBTP4PTzzTuQI3L/RYi6l
 +ttD7OHdRwqFAauBLY3bq3uJjYb5v/64ofd8DNnT2CJvtznY8wrPbTdFoSdPcL2Q
 69/opRWHcUwbU/Gt4WLtyQf3Mk0vepgMbbVg1B5SSy55atRZaXMrA2QJ/JeawZTB
 KK6D/mE7ccze/YFzsySunCUVKCm0veoNxEAcakCCZKXSbsvd1MYcIRC0e+2cv6e5
 2NEH7gL4dD+5tqu5nzvIuKDn3NrDQpbi28iUBoFbkxRgcVyvHJ9AGSa62wtb5h3D
 OgkqQMdVKBbjYNeUodPlQPzmXZDasytavyd0/BC/KENOcBvU/8gW++2UZTfsh/1A
 dLjgwFBdyJncQcCS9Abn20/EKntbIMEX8NLa97XWkA3fuzMKtak=
 =yEVq
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Optionally, provide an index of possible printk messages via
   <debugfs>/printk/index/. It can be used when monitoring important
   kernel messages on a farm of various hosts. The monitor has to be
   updated when some messages has changed or are not longer available by
   a newly deployed kernel.

 - Add printk.console_no_auto_verbose boot parameter. It allows to
   generate crash dump even with slow consoles in a reasonable time
   frame.

 - Remove printk_safe buffers. The messages are always stored directly
   to the main logbuffer, even in NMI or recursive context. Also it
   allows to serialize syslog operations by a mutex instead of a spin
   lock.

 - Misc clean up and build fixes.

* tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk/index: Fix -Wunused-function warning
  lib/nmi_backtrace: Serialize even messages about idle CPUs
  printk: Add printk.console_no_auto_verbose boot parameter
  printk: Remove console_silent()
  lib/test_scanf: Handle n_bits == 0 in random tests
  printk: syslog: close window between wait and read
  printk: convert @syslog_lock to mutex
  printk: remove NMI tracking
  printk: remove safe buffers
  printk: track/limit recursion
  lib/nmi_backtrace: explicitly serialize banner and regs
  printk: Move the printk() kerneldoc comment to its new home
  printk/index: Fix warning about missing prototypes
  MIPS/asm/printk: Fix build failure caused by printk
  printk: index: Add indexing support to dev_printk
  printk: Userspace format indexing support
  printk: Rework parse_prefix into printk_parse_prefix
  printk: Straighten out log_flags into printk_info_flags
  string_helpers: Escape double quotes in escape_special
  printk/console: Check consistent sequence number when handling race in console_unlock()
2021-09-01 18:41:13 -07:00
Linus Torvalds bcfeebbff3 Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exit cleanups from Eric Biederman:
 "In preparation of doing something about PTRACE_EVENT_EXIT I have
  started cleaning up various pieces of code related to do_exit. Most of
  that code I did not manage to get tested and reviewed before the merge
  window opened but a handful of very useful cleanups are ready to be
  merged.

  The first change is simply the removal of the bdflush system call. The
  code has now been disabled long enough that even the oldest userspace
  working userspace setups anyone can find to test are fine with the
  bdflush system call being removed.

  Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
  calling do_exit directly is interesting only in that it is nearly the
  most difficult of the incorrect uses of do_exit to remove.

  The change to the seccomp code to simply send a signal instead of
  calling do_coredump directly is a very nice little cleanup made
  possible by realizing the existing signal sending helpers were missing
  a little bit of functionality that is easy to provide"

* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal/seccomp: Dump core when there is only one live thread
  signal/seccomp: Refactor seccomp signal and coredump generation
  signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
  exit/bdflush: Remove the deprecated bdflush system call
2021-09-01 14:52:05 -07:00
Linus Torvalds 48983701a1 Merge branch 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo si_trapno updates from Eric Biederman:
 "The full set of si_trapno changes was not appropriate as a fix for the
  newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the
  related cleanups.

  This is the rest of the cleanups for si_trapno that reduces it from
  being a really weird arch special case that is expect to be always
  present (but isn't) on the architectures that support it to being yet
  another field in the _sigfault union of struct siginfo.

  The changes have been reviewed and marinated in linux-next. With the
  removal of this awkward special case new code (like SIGTRAP TRAP_PERF)
  that works across architectures should be easier to write and
  maintain"

* 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency
  signal: Verify the alignment and size of siginfo_t
  signal: Remove the generic __ARCH_SI_TRAPNO support
  signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK
  signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP
  arm64: Add compile-time asserts for siginfo_t offsets
  arm: Add compile-time asserts for siginfo_t offsets
  sparc64: Add compile-time asserts for siginfo_t offsets
2021-09-01 14:42:36 -07:00
Linus Torvalds 9e9fb7655e Core:
- Enable memcg accounting for various networking objects.
 
 BPF:
 
  - Introduce bpf timers.
 
  - Add perf link and opaque bpf_cookie which the program can read
    out again, to be used in libbpf-based USDT library.
 
  - Add bpf_task_pt_regs() helper to access user space pt_regs
    in kprobes, to help user space stack unwinding.
 
  - Add support for UNIX sockets for BPF sockmap.
 
  - Extend BPF iterator support for UNIX domain sockets.
 
  - Allow BPF TCP congestion control progs and bpf iterators to call
    bpf_setsockopt(), e.g. to switch to another congestion control
    algorithm.
 
 Protocols:
 
  - Support IOAM Pre-allocated Trace with IPv6.
 
  - Support Management Component Transport Protocol.
 
  - bridge: multicast: add vlan support.
 
  - netfilter: add hooks for the SRv6 lightweight tunnel driver.
 
  - tcp:
     - enable mid-stream window clamping (by user space or BPF)
     - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
     - more accurate DSACK processing for RACK-TLP
 
  - mptcp:
     - add full mesh path manager option
     - add partial support for MP_FAIL
     - improve use of backup subflows
     - optimize option processing
 
  - af_unix: add OOB notification support.
 
  - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by
          the router.
 
  - mac80211: Target Wake Time support in AP mode.
 
  - can: j1939: extend UAPI to notify about RX status.
 
 Driver APIs:
 
  - Add page frag support in page pool API.
 
  - Many improvements to the DSA (distributed switch) APIs.
 
  - ethtool: extend IRQ coalesce uAPI with timer reset modes.
 
  - devlink: control which auxiliary devices are created.
 
  - Support CAN PHYs via the generic PHY subsystem.
 
  - Proper cross-chip support for tag_8021q.
 
  - Allow TX forwarding for the software bridge data path to be
    offloaded to capable devices.
 
 Drivers:
 
  - veth: more flexible channels number configuration.
 
  - openvswitch: introduce per-cpu upcall dispatch.
 
  - Add internet mix (IMIX) mode to pktgen.
 
  - Transparently handle XDP operations in the bonding driver.
 
  - Add LiteETH network driver.
 
  - Renesas (ravb):
    - support Gigabit Ethernet IP
 
  - NXP Ethernet switch (sja1105)
    - fast aging support
    - support for "H" switch topologies
    - traffic termination for ports under VLAN-aware bridge
 
  - Intel 1G Ethernet
     - support getcrosststamp() with PCIe PTM (Precision Time
       Measurement) for better time sync
     - support Credit-Based Shaper (CBS) offload, enabling HW traffic
       prioritization and bandwidth reservation
 
  - Broadcom Ethernet (bnxt)
     - support pulse-per-second output
     - support larger Rx rings
 
  - Mellanox Ethernet (mlx5)
     - support ethtool RSS contexts and MQPRIO channel mode
     - support LAG offload with bridging
     - support devlink rate limit API
     - support packet sampling on tunnels
 
  - Huawei Ethernet (hns3):
     - basic devlink support
     - add extended IRQ coalescing support
     - report extended link state
 
  - Netronome Ethernet (nfp):
     - add conntrack offload support
 
  - Broadcom WiFi (brcmfmac):
     - add WPA3 Personal with FT to supported cipher suites
     - support 43752 SDIO device
 
  - Intel WiFi (iwlwifi):
     - support scanning hidden 6GHz networks
     - support for a new hardware family (Bz)
 
  - Xen pv driver:
     - harden netfront against malicious backends
 
  - Qualcomm mobile
     - ipa: refactor power management and enable automatic suspend
     - mhi: move MBIM to WWAN subsystem interfaces
 
 Refactor:
 
  - Ambient BPF run context and cgroup storage cleanup.
 
  - Compat rework for ndo_ioctl.
 
 Old code removal:
 
  - prism54 remove the obsoleted driver, deprecated by the p54 driver.
 
  - wan: remove sbni/granch driver.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEukBYACgkQMUZtbf5S
 IrsyHA//TO8dw18NYts4n9LmlJT2naJ7yBUUSSXK/M+DtW0MQ9nnHhqzPm5uJdRl
 IgQTNJrW3dYzRwgqaWZqEwO1t5/FI+f87ND1Nsekg7x9tF66a6ov5WxU26TwwSba
 U+si/inQ/4chuQ+LxMQobqCDxaLE46I2dIoRl+YfndJ24DRzYSwAEYIPPbSdfyU+
 +/l+3s4GaxO4k/hLciPAiOniyxLoUNiGUTNh+2yqRBXelSRJRKVnl+V22ANFrxRW
 nTEiplfVKhlPU1e4iLuRtaxDDiePHhw9I3j/lMHhfeFU2P/gKJIvz4QpGV0CAZg2
 1VvDU32WEx1GQLXJbKm0KwoNRUq1QSjOyyFti+BO7ugGaYAR4gKhShOqlSYLzUtB
 tbtzQhSNLWOGqgmSJOztZb5kFDm2EdRSll5/lP2uyFlPkIsIp0QbscJVzNTnS74b
 Xz15ZOw41Z4TfWPEMWgfrx6Zkm7pPWkly+7WfUkPcHa1gftNz6tzXXxSXcXIBPdi
 yQ5JCzzxrM5573YHuk5YedwZpn6PiAt4A/muFGk9C6aXP60TQAOS/ppaUzZdnk4D
 NfOk9mj06WEULjYjPcKEuT3GGWE6kmjb8Pu0QZWKOchv7vr6oZly1EkVZqYlXELP
 AfhcrFeuufie8mqm0jdb4LnYaAnqyLzlb1J4Zxh9F+/IX7G3yoc=
 =JDGD
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Enable memcg accounting for various networking objects.

  BPF:

   - Introduce bpf timers.

   - Add perf link and opaque bpf_cookie which the program can read out
     again, to be used in libbpf-based USDT library.

   - Add bpf_task_pt_regs() helper to access user space pt_regs in
     kprobes, to help user space stack unwinding.

   - Add support for UNIX sockets for BPF sockmap.

   - Extend BPF iterator support for UNIX domain sockets.

   - Allow BPF TCP congestion control progs and bpf iterators to call
     bpf_setsockopt(), e.g. to switch to another congestion control
     algorithm.

  Protocols:

   - Support IOAM Pre-allocated Trace with IPv6.

   - Support Management Component Transport Protocol.

   - bridge: multicast: add vlan support.

   - netfilter: add hooks for the SRv6 lightweight tunnel driver.

   - tcp:
       - enable mid-stream window clamping (by user space or BPF)
       - allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
       - more accurate DSACK processing for RACK-TLP

   - mptcp:
       - add full mesh path manager option
       - add partial support for MP_FAIL
       - improve use of backup subflows
       - optimize option processing

   - af_unix: add OOB notification support.

   - ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
     router.

   - mac80211: Target Wake Time support in AP mode.

   - can: j1939: extend UAPI to notify about RX status.

  Driver APIs:

   - Add page frag support in page pool API.

   - Many improvements to the DSA (distributed switch) APIs.

   - ethtool: extend IRQ coalesce uAPI with timer reset modes.

   - devlink: control which auxiliary devices are created.

   - Support CAN PHYs via the generic PHY subsystem.

   - Proper cross-chip support for tag_8021q.

   - Allow TX forwarding for the software bridge data path to be
     offloaded to capable devices.

  Drivers:

   - veth: more flexible channels number configuration.

   - openvswitch: introduce per-cpu upcall dispatch.

   - Add internet mix (IMIX) mode to pktgen.

   - Transparently handle XDP operations in the bonding driver.

   - Add LiteETH network driver.

   - Renesas (ravb):
       - support Gigabit Ethernet IP

   - NXP Ethernet switch (sja1105):
       - fast aging support
       - support for "H" switch topologies
       - traffic termination for ports under VLAN-aware bridge

   - Intel 1G Ethernet
       - support getcrosststamp() with PCIe PTM (Precision Time
         Measurement) for better time sync
       - support Credit-Based Shaper (CBS) offload, enabling HW traffic
         prioritization and bandwidth reservation

   - Broadcom Ethernet (bnxt)
       - support pulse-per-second output
       - support larger Rx rings

   - Mellanox Ethernet (mlx5)
       - support ethtool RSS contexts and MQPRIO channel mode
       - support LAG offload with bridging
       - support devlink rate limit API
       - support packet sampling on tunnels

   - Huawei Ethernet (hns3):
       - basic devlink support
       - add extended IRQ coalescing support
       - report extended link state

   - Netronome Ethernet (nfp):
       - add conntrack offload support

   - Broadcom WiFi (brcmfmac):
       - add WPA3 Personal with FT to supported cipher suites
       - support 43752 SDIO device

   - Intel WiFi (iwlwifi):
       - support scanning hidden 6GHz networks
       - support for a new hardware family (Bz)

   - Xen pv driver:
       - harden netfront against malicious backends

   - Qualcomm mobile
       - ipa: refactor power management and enable automatic suspend
       - mhi: move MBIM to WWAN subsystem interfaces

  Refactor:

   - Ambient BPF run context and cgroup storage cleanup.

   - Compat rework for ndo_ioctl.

  Old code removal:

   - prism54 remove the obsoleted driver, deprecated by the p54 driver.

   - wan: remove sbni/granch driver"

* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
  net: Add depends on OF_NET for LiteX's LiteETH
  ipv6: seg6: remove duplicated include
  net: hns3: remove unnecessary spaces
  net: hns3: add some required spaces
  net: hns3: clean up a type mismatch warning
  net: hns3: refine function hns3_set_default_feature()
  ipv6: remove duplicated 'net/lwtunnel.h' include
  net: w5100: check return value after calling platform_get_resource()
  net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
  net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
  net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
  fou: remove sparse errors
  ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
  octeontx2-af: Set proper errorcode for IPv4 checksum errors
  octeontx2-af: Fix static code analyzer reported issues
  octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
  octeontx2-af: Fix loop in free and unmap counter
  af_unix: fix potential NULL deref in unix_dgram_connect()
  dpaa2-eth: Replace strlcpy with strscpy
  octeontx2-af: Use NDC TX for transmit packet data
  ...
2021-08-31 16:43:06 -07:00
Linus Torvalds 86ac54e79f Merge branch 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "There is a long-standing subtle destroy_workqueue() bug where a
  workqueue can be destroyed while internal work items used for flushing
  are still in flight. Lai fixed it by assigning a flush color to the
  internal work items so that they are correctly waited for during
  destruction.

  Other than that, all are minor cleanups"

* 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Remove unused WORK_NO_COLOR
  workqueue: Assign a color to barrier work items
  workqueue: Mark barrier work with WORK_STRUCT_INACTIVE
  workqueue: Change the code of calculating work_flags in insert_wq_barrier()
  workqueue: Change arguement of pwq_dec_nr_in_flight()
  workqueue: Rename "delayed" (delayed by active management) to "inactive"
  workqueue: Replace deprecated CPU-hotplug functions.
  workqueue: Replace deprecated ida_simple_*() with ida_alloc()/ida_free()
  workqueue: Fix typo in comments
  workqueue: Fix possible memory leaks in wq_numa_init()
2021-08-31 15:53:20 -07:00
Linus Torvalds 69dc8010b8 Merge branch 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Two cpuset behavior changes:

   - cpuset on cgroup2 is changed to enable memory migration based on
     nodemask by default.

   - A notification is generated when cpuset partition state changes.

  All other patches are minor fixes and cleanups"

* 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Avoid compiler warnings with no subsystems
  cgroup/cpuset: Avoid memory migration when nodemasks match
  cgroup/cpuset: Enable memory migration for cpuset v2
  cgroup/cpuset: Enable event notification when partition state changes
  cgroup: cgroup-v1: clean up kernel-doc notation
  cgroup: Replace deprecated CPU-hotplug functions.
  cgroup/cpuset: Fix violation of cpuset locking rule
  cgroup/cpuset: Fix a partition bug with hotplug
  cgroup/cpuset: Miscellaneous code cleanup
  cgroup: remove cgroup_mount from comments
2021-08-31 15:49:04 -07:00
Linus Torvalds 5cbba60596 Power management updates for 5.15-rc1
- Address 3 PCI device power management issues (Rafael Wysocki).
 
  - Add Power Limit4 support for Alder Lake to the Intel RAPL power
    capping driver (Sumeet Pawnikar).
 
  - Add HWP guaranteed performance change notification support to
    the intel_pstate driver (Srinivas Pandruvada).
 
  - Replace deprecated CPU-hotplug functions in code related to power
    management (Sebastian Andrzej Siewior).
 
  - Update CPU PM notifiers to use raw spinlocks (Valentin Schneider).
 
  - Add support for 'required-opps' DT property to the generic power
    domains (genpd) framework and use this property for I2C on ARM64
    sc7180 (Rajendra Nayak).
 
  - Fix Kconfig issue related to genpd (Geert Uytterhoeven).
 
  - Increase energy calculation precision in the Energy Model (Lukasz
    Luba).
 
  - Fix kobject deletion in the exit code of the schedutil cpufreq
    governor (Kevin Hao).
 
  - Unmark some functions as kernel-doc in the PM core to avoid
    false-positive documentation build warnings (Randy Dunlap).
 
  - Check RTC features instead of ops in suspend_test Alexandre
    Belloni).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmEtIvYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxnQ8QAK2QCyfFAdrPE3zn+XdTcVC8rpYhL0d2
 3YpbS2obZ4lOHsrNy7SrrU5NdgvCFPHl14Py5rxu4QZW8EF5km6SqJYj9MhrDsEm
 wJ/ZM3Zbaj98MAls7pulHxabMo8hvGfMSmcNOFfgh7yqCm5eu8gNnt/LbLaB9IvI
 v46ZGNvnpbtkWgk4Eaq+v3Y5/+4CoUfuAFn7K21SHNcgzDbhE3Ii6cam/rwPeCdn
 a+LcDLzeDQpnnYKukx7Zyk7Z+FblXWHWZ/ClLcR8JgYYSrbrXxSLtRdM5EtLje2J
 ZnqFKqmlMcq2OhTjBSoHSJjiOkxyz5eWWrEf7d1/oH19WSW+rv5VfZUXAkYtfGKo
 OJChuIomrLNqJhAwTnxG3fnVh2NMhLwDVEjkFgvej4shCZXhoVgWu39mWnbyTxV3
 57adiuVUv6AboH6Lyvi/yzV7sWmZbL/U/YVvt+Z9SEh0syy6YBb87bW3Mr1qnLeG
 QF4AmF553qzHwd9uxUoFfiUoBxH1GvNmOWtx6lYKhy0lm3AwvW1XHKqsDG4KbVJg
 zzok7J2yv1/1rd8dJ+8tCwyyC3WJjNjtq/ULOltugwQpqVK4PCwWAUmjJaTP3Gpp
 ZH/bGuLSQWxxiinwKCkFSxwbewJu+AvtoXYKBfk/Gf5O6GyUyryBbBMO9dsqP27A
 Zg3P4+ejS7RN
 =sx6G
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These address some PCI device power management issues, add new
  hardware support to the RAPL power capping driver, add HWP guaranteed
  performance change notification support to the intel_pstate driver,
  replace deprecated CPU-hotplug functions in a few places, update CPU
  PM notifiers to use raw spinlocks, update the PM domains framework
  (new DT property support, Kconfig fix), do a couple of cleanups in
  code related to system sleep, and improve the energy model and the
  schedutil cpufreq governor.

  Specifics:

   - Address 3 PCI device power management issues (Rafael Wysocki).

   - Add Power Limit4 support for Alder Lake to the Intel RAPL power
     capping driver (Sumeet Pawnikar).

   - Add HWP guaranteed performance change notification support to the
     intel_pstate driver (Srinivas Pandruvada).

   - Replace deprecated CPU-hotplug functions in code related to power
     management (Sebastian Andrzej Siewior).

   - Update CPU PM notifiers to use raw spinlocks (Valentin Schneider).

   - Add support for 'required-opps' DT property to the generic power
     domains (genpd) framework and use this property for I2C on ARM64
     sc7180 (Rajendra Nayak).

   - Fix Kconfig issue related to genpd (Geert Uytterhoeven).

   - Increase energy calculation precision in the Energy Model (Lukasz
     Luba).

   - Fix kobject deletion in the exit code of the schedutil cpufreq
     governor (Kevin Hao).

   - Unmark some functions as kernel-doc in the PM core to avoid
     false-positive documentation build warnings (Randy Dunlap).

   - Check RTC features instead of ops in suspend_test Alexandre
     Belloni)"

* tag 'pm-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: domains: Fix domain attach for CONFIG_PM_OPP=n
  powercap: Add Power Limit4 support for Alder Lake SoC
  cpufreq: intel_pstate: Process HWP Guaranteed change notification
  thermal: intel: Allow processing of HWP interrupt
  notifier: Remove atomic_notifier_call_chain_robust()
  PM: cpu: Make notifier chain use a raw_spinlock_t
  PM: sleep: unmark 'state' functions as kernel-doc
  arm64: dts: sc7180: Add required-opps for i2c
  PM: domains: Add support for 'required-opps' to set default perf state
  opp: Don't print an error if required-opps is missing
  cpufreq: schedutil: Use kobject release() method to free sugov_tunables
  PM: EM: Increase energy calculation precision
  PM: sleep: check RTC features instead of ops in suspend_test
  PM: sleep: s2idle: Replace deprecated CPU-hotplug functions
  cpufreq: Replace deprecated CPU-hotplug functions
  powercap: intel_rapl: Replace deprecated CPU-hotplug functions
  PCI: PM: Enable PME if it can be signaled from D3cold
  PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently
  PCI: Use pci_update_current_state() in pci_enable_device_flags()
2021-08-31 13:21:58 -07:00
Linus Torvalds 8e0cd9525c Merge tag 'audit-pr-20210830' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
 "Two patches in the audit pull request for v5.15; one is trivial
  ("header protection") but the second is a real patch that fixes a
  refcounting problem.

  The refcount fix normally would have been sent up during the -rcX
  cycle, but since we merged it less than a week before v5.14 proper I
  felt it was better to wait for the merge window to open; the patch is
  marked with the usual -stable markings"

* tag 'audit-pr-20210830' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: move put_tree() to avoid trim_trees refcount underflow and UAF
  audit: add header protection to kernel/audit.h
2021-08-31 12:55:50 -07:00
Linus Torvalds e55f0c439a kernel.sys.v5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYS3lRQAKCRCRxhvAZXjc
 oqH9AP999iWN7nOOr4QpnQZVMEbwYlZksdjJso0i2Nd87rNMWQEAgKYsnm00dlLm
 uV/X21a9W6RJYgOGP4+BY4DAyVezpgk=
 =vSMk
 -----END PGP SIGNATURE-----

Merge tag 'kernel.sys.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull set_user()  update from Christian Brauner:
 "This contains a single fix to set_user() which aligns permission
  checks with the corresponding fork() codepath. No one involved in this
  could come up with a reason for the difference.

  A capable caller can already circumvent the check when they fork where
  the permission checks are already for the relevant capabilities in
  addition to also allowing to exceed nproc when it is the init user.

  So apply the same logic to set_user()"

* tag 'kernel.sys.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds
2021-08-31 12:16:05 -07:00
Claire Chang f3c4b1341e swiotlb: use depends on for DMA_RESTRICTED_POOL
Use depends on instead of select for DMA_RESTRICTED_POOL; otherwise it
will make SWIOTLB user configurable and cause compile errors for some
arch (e.g. mips).

Fixes: 0b84e4f8b7 ("swiotlb: Add restricted DMA pool initialization")
Acked-by: Will Deacon <will@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Claire Chang <tientzu@chromium.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-08-31 14:48:14 -04:00
Linus Torvalds 8bda955776 New features:
- Support for server-side disconnect injection via debugfs
 - Protocol definitions for new RPC_AUTH_TLS authentication flavor
 
 Performance improvements:
 - Reduce page allocator traffic in the NFSD splice read actor
 - Reduce CPU utilization in svcrdma's Send completion handler
 
 Notable bug fixes:
 - Stabilize lockd operation when re-exporting NFS mounts
 - Fix the use of %.*s in NFSD tracepoints
 - Fix /proc/sys/fs/nfs/nsm_use_hostnames
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmEqq0AACgkQM2qzM29m
 f5dYig/5AaPN2BWYf4D1VkrAS3+zGS+3IN23WVgpbA54jgfjPEH+Aa00YhEQQa0j
 Y5u/jE5g/tWvenDefq5BmvdRfZMWCVc2JkngctOSflhaREUWK+HgCkH+5DQs6zUM
 rbX7qy0v6wJnEMSlwCKJ2AuZbYw7Bsg2nvOgEbb718/ent3umeoXEK09x3HTWLEp
 eVcMU5uicB5wRRPpROYG792oWzUScQ8kyiRCKJfQDoR7bINhBeVHObAIFMBo1UaH
 x9CMX4RlPYGmoMYUc+AqcOM7hizucHpXqM1r3oVjQ7FyI+pmDLuLL/3OTjtRUX7+
 nYLqNW/PijH9PjFe4BPjGHAUQfKiTIXANAe8VdjQj70D40jYkP+jQ9SPdV+pEgi4
 U4azfK3S+85/bRYYq/1alcLiP1+6dgcL++rVvnKESTH9NRgNoEw2WZHeKxXiYaxU
 p7oOC4XdnYDwcz/3QVWa0sK2kA5IJHzOsCQR7OilD09NAJ+AbJTAp0H3xFXTllzb
 AV2CAEBVZlP+pZYOehuVnKpZPa7YAWx92wRK2anbRUMZN3lF1wWBEOTd6KweIpTx
 l2GJSf3GWBqL1x9PjSet/cBusxYjTA+S1hE7KMrsNPhzbvpIgAZEtSqOfn9apDCV
 uAFIN2DSiHm3Tv0aFSJWo+CMyKkyktuiS8JFKaFdzCp9NtsBM2M=
 =TGkK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "New features:

   - Support for server-side disconnect injection via debugfs

   - Protocol definitions for new RPC_AUTH_TLS authentication flavor

  Performance improvements:

   - Reduce page allocator traffic in the NFSD splice read actor

   - Reduce CPU utilization in svcrdma's Send completion handler

  Notable bug fixes:

   - Stabilize lockd operation when re-exporting NFS mounts

   - Fix the use of %.*s in NFSD tracepoints

   - Fix /proc/sys/fs/nfs/nsm_use_hostnames"

* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  nfsd: fix crash on LOCKT on reexported NFSv3
  nfs: don't allow reexport reclaims
  lockd: don't attempt blocking locks on nfs reexports
  nfs: don't atempt blocking locks on nfs reexports
  Keep read and write fds with each nlm_file
  lockd: update nlm_lookup_file reexport comment
  nlm: minor refactoring
  nlm: minor nlm_lookup_file argument change
  lockd: lockd server-side shouldn't set fl_ops
  SUNRPC: Add documentation for the fail_sunrpc/ directory
  SUNRPC: Server-side disconnect injection
  SUNRPC: Move client-side disconnect injection
  SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
  svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
  nfsd4: Fix forced-expiry locking
  rpc: fix gss_svc_init cleanup on failure
  SUNRPC: Add RPC_AUTH_TLS protocol numbers
  lockd: change the proc_handler for nsm_use_hostnames
  sysctl: introduce new proc handler proc_dobool
  SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
  ...
2021-08-31 10:57:06 -07:00
Linus Torvalds c547d89a9a for-5.15/io_uring-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs7tIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR2EACdOPj0tivXWufgFnQyQYPpX4/Qe3lNw608
 MLJ9/zshPFe5kx+SnHzT3UucHd2LO4C68pNVEHdHoi1gqMnMe9+Du82LJlo+Cx9i
 53yiWxNiY7er3t3lC2nvaPF0BPKiaUKaRzqugOTfdWmKLP3MyYyygEBrRDkaK1S1
 BjVq2ewmKTYf63gHkluHeRTav9KxcLWvSqgFUC8Y0mNhazTSdzGB6MAPFodpsuj1
 Vv8ytCiagp9Gi0AibvS2mZvV/WxQtFP7qBofbnG3KcgKgOU+XKTJCH+cU6E/3J9Q
 nPy1loh2TISOtYAz2scypAbwsK4FWeAHg1SaIj/RtUGKG7zpVU5u97CPZ8+UfHzu
 CuR3a1o36Cck8+ZdZIjtRZfvQGog0Dh5u4ZQ4dRwFQd6FiVxdO8LHkegqkV6PStc
 dVrHSo5kUE5hGT8ed1YFfuOSJDZ6w0/LleZGMdU4pRGGs8wqkerZlfLL4ustSfLk
 AcS+azmG2f3iI5iadnxOUNeOT2lmE84fvvLyH2krfsA3AtX0CtHXQcLYAguRAIwg
 gnvYf70JOya6Lb/hbXUi8d8h3uXxeFsoitz1QyasqAEoJoY05EW+l94NbiEzXwol
 mKrVfZCk+wuhw3npbVCqK7nkepvBWso1qTip//T0HDAaKaZnZcmhobUn5SwI+QPx
 fcgAo6iw8g==
 =WBzF
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - cancellation cleanups (Hao, Pavel)

 - io-wq accounting cleanup (Hao)

 - io_uring submit locking fix (Hao)

 - io_uring link handling fixes (Hao)

 - fixed file improvements (wangyangbo, Pavel)

 - allow updates of linked timeouts like regular timeouts (Pavel)

 - IOPOLL fix (Pavel)

 - remove batched file get optimization (Pavel)

 - improve reference handling (Pavel)

 - IRQ task_work batching (Pavel)

 - allow pure fixed file, and add support for open/accept (Pavel)

 - GFP_ATOMIC RT kernel fix

 - multiple CQ ring waiter improvement

 - funnel IRQ completions through task_work

 - add support for limiting async workers explicitly

 - add different clocksource support for timeouts

 - io-wq wakeup race fix

 - lots of cleanups and improvement (Pavel et al)

* tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits)
  io-wq: fix wakeup race when adding new work
  io-wq: wqe and worker locks no longer need to be IRQ safe
  io-wq: check max_worker limits if a worker transitions bound state
  io_uring: allow updating linked timeouts
  io_uring: keep ltimeouts in a list
  io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts
  io-wq: provide a way to limit max number of workers
  io_uring: add build check for buf_index overflows
  io_uring: clarify io_req_task_cancel() locking
  io_uring: add task-refs-get helper
  io_uring: fix failed linkchain code logic
  io_uring: remove redundant req_set_fail()
  io_uring: don't free request to slab
  io_uring: accept directly into fixed file table
  io_uring: hand code io_accept() fd installing
  io_uring: openat directly into fixed fd table
  net: add accept helper not installing fd
  io_uring: fix io_try_cancel_userdata race for iowq
  io_uring: IRQ rw completion batching
  io_uring: batch task work locking
  ...
2021-08-30 19:22:52 -07:00
Linus Torvalds 9a1d6c9e3f for-5.15/drivers-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6LsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqnLD/9c8v7WTLjrDR6FLD8fHUmkwk9ss6OeyYJC
 Z62QOyk6BqNOu6FAwBax9wFaboXdUqOdpJU0PVQ7WJ5wBiCQ9DAZY6T+iwW0jE79
 +iOSqdXHVLAIyIM9GplzLH5AH3tx4445bX7fRWwWX1OgmSidkAhb25FusCvpcpHx
 1k+9dSLClLeHPR6jVT3k6tHv2RzPSw+/vYOggeWYA0YYPfoCx/Ft0uwO+PjKpvLQ
 Je5jASlLGYCXazswJBZgfjbroA97EuaLOmHHIHrwhkkFsbV6ewv6mlmanbMEs4fX
 Wh+axTt8so27g6gbw31EOcGsxTi0B37Jx9MOrSla6NdJoZkFE2sn6K+D5k4oeSrg
 QgYXL00U62eSgWmgSB0f0X081cQfI+FUMe5u5S368WdrgCPfaXl11zHw8nXw8gEW
 UvqR4zr3hQd4piXsIWl2bwZrmpPBCeB8iStLq3C92RLPFT6hJO3GM/ZmwTn+0HT0
 lMXzoEdkPywkKWi8aBbSgzXiGknNl8HAYnwMhcQjiHbYQOycGkI9pigJDNY9Ox1l
 fYHFSompmJ/XK8cIiU7QIglXEXJky5jQ89Ni0ryCstOaP20tPxWtkpOCgidXfNGz
 4lmQV8D5aBTUFs6ifPjXfiXUmDiU3SaxiFhAqaEkGII9BbkrNhlibB4LBAU+toi1
 Q0yGhGR/mg==
 =4uWF
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Sitting on top of the core block changes, here are the driver changes
  for the 5.15 merge window:

   - NVMe updates via Christoph:
       - suspend improvements for devices with an HMB (Keith Busch)
       - handle double completions more gacefull (Sagi Grimberg)
       - cleanup the selects for the nvme core code a bit (Sagi Grimberg)
       - don't update queue count when failing to set io queues (Ruozhu Li)
       - various nvmet connect fixes (Amit Engel)
       - cleanup lightnvm leftovers (Keith Busch, me)
       - small cleanups (Colin Ian King, Hou Pu)
       - add tracing for the Set Features command (Hou Pu)
       - CMB sysfs cleanups (Keith Busch)
       - add a mutex_destroy call (Keith Busch)

   - remove lightnvm subsystem. It's served its purpose and ultimately
     led to zoned nvme support, we no longer need it (Christoph)

   - revert floppy O_NDELAY fix (Denis)

   - nbd fixes (Hou, Pavel, Baokun)

   - nbd locking fixes (Tetsuo)

   - nbd device removal fixes (Christoph)

   - raid10 rcu warning fix (Xiao)

   - raid1 write behind fix (Guoqing)

   - rnbd fixes (Gioh, Md Haris)

   - misc fixes (Colin)"

* tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits)
  Revert "floppy: reintroduce O_NDELAY fix"
  raid1: ensure write behind bio has less than BIO_MAX_VECS sectors
  md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard
  nbd: remove nbd->destroy_complete
  nbd: only return usable devices from nbd_find_unused
  nbd: set nbd->index before releasing nbd_index_mutex
  nbd: prevent IDR lookups from finding partially initialized devices
  nbd: reset NBD to NULL when restarting in nbd_genl_connect
  nbd: add missing locking to the nbd_dev_add error path
  nvme: remove the unused NVME_NS_* enum
  nvme: remove nvm_ndev from ns
  nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
  block: nbd: add sanity check for first_minor
  nvmet: check that host sqsize does not exceed ctrl MQES
  nvmet: avoid duplicate qid in connect cmd
  nvmet: pass back cntlid on successful completion
  nvme-rdma: don't update queue count when failing to set io queues
  nvme-tcp: don't update queue count when failing to set io queues
  nvme-tcp: pair send_mutex init with destroy
  nvme: allow user toggling hmb usage
  ...
2021-08-30 19:01:46 -07:00
Jakub Kicinski 19a31d7921 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2021-08-31

We've added 116 non-merge commits during the last 17 day(s) which contain
a total of 126 files changed, 6813 insertions(+), 4027 deletions(-).

The main changes are:

1) Add opaque bpf_cookie to perf link which the program can read out again,
   to be used in libbpf-based USDT library, from Andrii Nakryiko.

2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu.

3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang.

4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch
   to another congestion control algorithm during init, from Martin KaFai Lau.

5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima.

6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta.

7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT}
   progs, from Xu Liu and Stanislav Fomichev.

8) Support for __weak typed ksyms in libbpf, from Hao Luo.

9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky.

10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov.

11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann.

12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian,
    Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others.

13) Another big batch to revamp XDP samples in order to give them consistent look
    and feel, from Kumar Kartikeya Dwivedi.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits)
  MAINTAINERS: Remove self from powerpc BPF JIT
  selftests/bpf: Fix potential unreleased lock
  samples: bpf: Fix uninitialized variable in xdp_redirect_cpu
  selftests/bpf: Reduce more flakyness in sockmap_listen
  bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
  bpf: selftests: Add dctcp fallback test
  bpf: selftests: Add connect_to_fd_opts to network_helpers
  bpf: selftests: Add sk_state to bpf_tcp_helpers.h
  bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
  selftests: xsk: Preface options with opt
  selftests: xsk: Make enums lower case
  selftests: xsk: Generate packets from specification
  selftests: xsk: Generate packet directly in umem
  selftests: xsk: Simplify cleanup of ifobjects
  selftests: xsk: Decrease sending speed
  selftests: xsk: Validate tx stats on tx thread
  selftests: xsk: Simplify packet validation in xsk tests
  selftests: xsk: Rename worker_* functions that are not thread entry points
  selftests: xsk: Disassociate umem size with packets sent
  selftests: xsk: Remove end-of-test packet
  ...
====================

Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-30 16:42:47 -07:00
Linus Torvalds 8596e589b7 Updates for timekeeping, timers and related drivers:
Core code:
 
   - Cure a couple of incorrectness issues in the posix CPU timer code to
     prevent that the tick dependency for NOHZ full is kept alive for no
     reason.
 
   - Avoid expensive double reprogramming of the clockevent device in
     hrtimer_start_range_ns().
 
   - Avoid pointless SMP function calls when the clock was set to avoid
     disturbing CPUs which do not have any affected timers queued.
 
   - Make the clocksource watchdog test work correctly when CONFIG_HZ is
     less than 100.
 
 Drivers:
 
   - Prefer the ARM architected timer over the Exynos timer which is way
     more expensive to access.
 
   - Add device tree bindings for new Ingenic SoCs
 
   - The usual improvements and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnxcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZAmEAC0R5+9RkWAOpx0JWC7dQxFuIZoUZD1
 8Inqs0ZFLX/LNrkjbBmZ/0XFvoU38+Eqd4Gqy3I748TgzMcB/0NHUUnXaugJE35J
 UzqdHkzhSReVinDHgRcIGMxkdj+JGomhlOM7RjTcdTixUWBfi1iXJBsit/tIWYq9
 R9r/cthpEQzFu7BCZCdAsgRBaLNinCH9cCP0qXNS3hDwHFtziszjBhIrElAaEK4J
 IqrW/tsoxOlX/yQ/5TI8+E9DKgrr4pf+uNVjMJC0QBDodJrXRrkez9lFg6zJJggw
 adtzSFgL/OrbwsuEKAREpkSTwWSdQGdtLy2i9fx16jmh78YiTilHYe2A1ZD5v3zd
 dxfHUexnsgXcn4Im9w+sxLGxf2RQ6SsfVgd+R0lyOKLBFltnmbQWcvVl/6ZUa4Cc
 je+yuh9DTr0ksUiCnm0spP8AMtsSaHKJUP+MqHbXgo83AutVIFXr4zyQimM1lkY1
 TPeXtbKlhd76jDgdcKU85tiyLsJrIImEJDzLtOEvSAk37yO6S9PHzLUMbg9Yp/vp
 Li4aUaMNmytCJTvKeSu6Wivzmyxqf4zSJus/fWLIJJjg/NSlNNhFDc0vkKPzaxM7
 mg2VIcPrQKzQ1ZAap1kTdj1JNDWuANlV6pjE+zwaJbMtdHhvELyWnqC6EA05MsVj
 Txw2VVFaQcqKXA==
 =eP5w
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Updates for timekeeping, timers and related drivers:

  Core code:

   - Cure a couple of correctness issues in the posix CPU timer code to
     prevent that the tick dependency for NOHZ full is kept alive for no
     reason.

   - Avoid expensive double reprogramming of the clockevent device in
     hrtimer_start_range_ns().

   - Avoid pointless SMP function calls when the clock was set to avoid
     disturbing CPUs which do not have any affected timers queued.

   - Make the clocksource watchdog test work correctly when CONFIG_HZ is
     less than 100.

  Drivers:

   - Prefer the ARM architected timer over the Exynos timer which is way
     more expensive to access.

   - Add device tree bindings for new Ingenic SoCs

   - The usual improvements and cleanups all over the place"

* tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  clocksource: Make clocksource watchdog test safe for slow-HZ systems
  dt-bindings: timer: Add ABIs for new Ingenic SoCs
  clocksource/drivers/fttmr010: Pass around less pointers
  clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown
  clocksource/drivers/ingenic: Use bitfield macro helpers
  clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
  dt-bindings: timer: convert rockchip,rk-timer.txt to YAML
  clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU
  clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64
  hrtimer: Unbreak hrtimer_force_reprogram()
  hrtimer: Use raw_cpu_ptr() in clock_was_set()
  hrtimer: Avoid more SMP function calls in clock_was_set()
  hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
  hrtimer: Add bases argument to clock_was_set()
  time/timekeeping: Avoid invoking clock_was_set() twice
  timekeeping: Distangle resume and clock-was-set events
  timerfd: Provide timerfd_resume()
  hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
  hrtimer: Ensure timerfd notification for HIGHRES=n
  hrtimer: Consolidate reprogramming code
  ...
2021-08-30 15:31:33 -07:00
Linus Torvalds 7d6e3fa87e Updates to the interrupt core and driver subsystems:
Core changes:
 
    - The usual set of small fixes and improvements all over the place, but nothing
      outstanding
 
 MSI changes:
 
    - Further consolidation of the PCI/MSI interrupt chip code
 
    - Make MSI sysfs code independent of PCI/MSI and expose the MSI interrupts
      of platform devices in the same way as PCI exposes them.
 
 Driver changes:
 
    - Support for ARM GICv3 EPPI partitions
 
    - Treewide conversion to generic_handle_domain_irq() for all chained
      interrupt controllers
 
    - Conversion to bitmap_zalloc() throughout the irq chip drivers
 
    - The usual set of small fixes and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnpsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoS+/EACQdpRkzl3IDIYqThxVZ8KQzp2rKKVn
 qisAQiWg/6koNJx/yYy62KNAUyKjCIObNtRnWi7OAOx6OvNtQTD2WOLAwkh3Pgw1
 8ePYYl55k+yCs8VoITsZM9jYeO+Tk878pU2A6R943zR+g6G7bskGJrxEyZ9TbzIe
 qKfusNKnRY9/jMQaRALUAAtA9VIVR867GqORX5X8hKz8yE2rqlpb4y+1CFba5BTV
 Vlxw7cIXvXBn7BKAom5diRqEGDNJEbX+56jJ7yDZshgLo7m11D7QLw72kmb6TNVC
 g7PchvFi4afpc1ifEAAp0tk4RiSIAQ91nS3n0+jLcLbodOjIkl14eY02ZCJGAP29
 uslyzUbmy1wgejG6CA63JtZ4MYdrf/OSMGuoN78qnOKYcIsWFzOvlJmBWWNW34qW
 LCaUF9QdJ/slXu6B4vIx30GfN9q4myml8bFUobE5q9mBRrEk4R0B7iyBvPu1xKYr
 ZEan67prI5VEu+afJGpp4r294m4HNVkMLfl3nYmE5+y4MoLeMNKDY3IPTvI9iP4G
 kaFgoPvQo23WnuclNYpJ+CaA4aRASlB2nTY+oAXIYfehbey9EW5vq4/EK864ek6w
 oyUTepxxNhE81tG2jpQbf2tR4COsEHy986clxqPP4AvsZXcbypCw8O2FcflpQbHO
 5DLEAfTmp7cziQ==
 =qyll
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates to the interrupt core and driver subsystems:

  Core changes:

   - The usual set of small fixes and improvements all over the place,
     but nothing stands out

  MSI changes:

   - Further consolidation of the PCI/MSI interrupt chip code

   - Make MSI sysfs code independent of PCI/MSI and expose the MSI
     interrupts of platform devices in the same way as PCI exposes them.

  Driver changes:

   - Support for ARM GICv3 EPPI partitions

   - Treewide conversion to generic_handle_domain_irq() for all chained
     interrupt controllers

   - Conversion to bitmap_zalloc() throughout the irq chip drivers

   - The usual set of small fixes and improvements"

* tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  platform-msi: Add ABI to show msi_irqs of platform devices
  genirq/msi: Move MSI sysfs handling from PCI to MSI core
  genirq/cpuhotplug: Demote debug printk to KERN_DEBUG
  irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy
  irqdomain: Export irq_domain_disconnect_hierarchy()
  irqchip/gic-v3: Fix priority comparison when non-secure priorities are used
  irqchip/apple-aic: Fix irq_disable from within irq handlers
  pinctrl/rockchip: drop the gpio related codes
  gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type
  gpio/rockchip: support next version gpio controller
  gpio/rockchip: use struct rockchip_gpio_regs for gpio controller
  gpio/rockchip: add driver for rockchip gpio
  dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank
  pinctrl/rockchip: add pinctrl device to gpio bank struct
  pinctrl/rockchip: separate struct rockchip_pin_bank to a head file
  pinctrl/rockchip: always enable clock for gpio controller
  genirq: Fix kernel doc indentation
  EDAC/altera: Convert to generic_handle_domain_irq()
  powerpc: Bulk conversion to generic_handle_domain_irq()
  nios2: Bulk conversion to generic_handle_domain_irq()
  ...
2021-08-30 14:38:37 -07:00
Linus Torvalds e5e726f7bb Updates for locking and atomics:
The regular pile:
 
   - A few improvements to the mutex code
 
   - Documentation updates for atomics to clarify the difference between
     cmpxchg() and try_cmpxchg() and to explain the forward progress
     expectations.
 
   - Simplification of the atomics fallback generator
 
   - The addition of arch_atomic_long*() variants and generic arch_*()
     bitops based on them.
 
   - Add the missing might_sleep() invocations to the down*() operations of
     semaphores.
 
 The PREEMPT_RT locking core:
 
   - Scheduler updates to support the state preserving mechanism for
     'sleeping' spin- and rwlocks on RT. This mechanism is carefully
     preserving the state of the task when blocking on a 'sleeping' spin- or
     rwlock and takes regular wake-ups targeted at the same task into
     account. The preserved or updated (via a regular wakeup) state is
     restored when the lock has been acquired.
 
   - Restructuring of the rtmutex code so it can be utilized and extended
     for the RT specific lock variants.
 
   - Restructuring of the ww_mutex code to allow sharing of the ww_mutex
     specific functionality for rtmutex based ww_mutexes.
 
   - Header file disentangling to allow substitution of the regular lock
     implementations with the PREEMPT_RT variants without creating an
     unmaintainable #ifdef mess.
 
   - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock
     implementations. Contrary to the regular rw_semaphores and rwlocks the
     PREEMPT_RT implementation is writer unfair because it is infeasible to
     do priority inheritance on multiple readers. Experience over the years
     has shown that real-time workloads are not the typical workloads which
     are sensitive to writer starvation. The alternative solution would be
     to allow only a single reader which has been tried and discarded as it
     is a major bottleneck especially for mmap_sem. Aside of that many of
     the writer starvation critical usage sites have been converted to a
     writer side mutex/spinlock and RCU read side protections in the past
     decade so that the issue is less prominent than it used to be.
 
   - The actual rtmutex based lock substitutions for PREEMPT_RT enabled
     kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
     rwlock_t. The spin/rw_lock*() functions disable migration across the
     critical section to preserve the existing semantics vs. per CPU
     variables.
 
   - Rework of the futex REQUEUE_PI mechanism to handle the case of early
     wake-ups which interleave with a re-queue operation to prevent the
     situation that a task would be blocked on both the rtmutex associated
     to the outer futex and the rtmutex based hash bucket spinlock.
 
     While this situation cannot happen on !RT enabled kernels the changes
     make the underlying concurrency problems easier to understand in
     general. As a result the difference between !RT and RT kernels is
     reduced to the handling of waiting for the critical section. !RT
     kernels simply spin-wait as before and RT kernels utilize rcu_wait().
 
   - The substitution of local_lock for PREEMPT_RT with a spinlock which
     protects the critical section while staying preemptible. The CPU
     locality is established by disabling migration.
 
   The underlying concepts of this code have been in use in PREEMPT_RT for
   way more than a decade. The code has been refactored several times over
   the years and this final incarnation has been optimized once again to be
   as non-intrusive as possible, i.e. the RT specific parts are mostly
   isolated.
 
   It has been extensively tested in the 5.14-rt patch series and it has
   been verified that !RT kernels are not affected by these changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnuMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaeWD/wLNMoAZXslS0prfr64ANjRgLXIqMFA
 r6xgioiwxxaxbmZ/GNPraoLC//ENo6mwobuUovq8yKljv2oBu6AmlUkBwrmMBc8Q
 nnm7jjGM3bZ1REup7rWERnjdOZfdGVSL5CUAAfthyC744XmXaepwrrrqfXG22GxJ
 QwLXBTAwXFVDxKfUjDKzEo5zgLNHRvHbzc0DpTYYn6WcuDJOmlyWnhfDTu2mNG9Z
 rqjqy+OgOUEUprQDgitk5hedfeic2kPm1mxxZrXkpkuPef5be2inQq2siC7GxR4g
 0AKeUsMFgFmSqiD4iJTALJ+8WXkgMnD9VgooeWHk4OaqZfaGzi/iwRSnrlnf7+OV
 GTmrsmX+TX/Wz2BDjB+3zylQnYqYh3quE5w4UO6uUyJXfdhlnvsjVc8bEajDFjeM
 yUapaWxdAri7k2n+vjXQthAngxtYPgXtFbZPoOl109JcDcG6jJsCdM5TdenegaRs
 WeUh05JqrH8+qI+Nwzc4rO+PmKHQ8on2wKdgLp11dviiPOf8OguH65nDQSGZ/fGv
 7cnD9A1/MUd0sdrvc52AqkIYxh+Rp9GnCs1xA82JsTXgAPcXqAWjjR2JFPHL4neV
 eW2upZekl8lMR7hkfcQbhe4MVjQIjff3iFOkQXittxMzfzFdi0tly8xB8AzpTHOx
 h91MycvmMR2zRw==
 =IEqE
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking and atomics updates from Thomas Gleixner:
 "The regular pile:

   - A few improvements to the mutex code

   - Documentation updates for atomics to clarify the difference between
     cmpxchg() and try_cmpxchg() and to explain the forward progress
     expectations.

   - Simplification of the atomics fallback generator

   - The addition of arch_atomic_long*() variants and generic arch_*()
     bitops based on them.

   - Add the missing might_sleep() invocations to the down*() operations
     of semaphores.

  The PREEMPT_RT locking core:

   - Scheduler updates to support the state preserving mechanism for
     'sleeping' spin- and rwlocks on RT.

     This mechanism is carefully preserving the state of the task when
     blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups
     targeted at the same task into account. The preserved or updated
     (via a regular wakeup) state is restored when the lock has been
     acquired.

   - Restructuring of the rtmutex code so it can be utilized and
     extended for the RT specific lock variants.

   - Restructuring of the ww_mutex code to allow sharing of the ww_mutex
     specific functionality for rtmutex based ww_mutexes.

   - Header file disentangling to allow substitution of the regular lock
     implementations with the PREEMPT_RT variants without creating an
     unmaintainable #ifdef mess.

   - Shared base code for the PREEMPT_RT specific rw_semaphore and
     rwlock implementations.

     Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT
     implementation is writer unfair because it is infeasible to do
     priority inheritance on multiple readers. Experience over the years
     has shown that real-time workloads are not the typical workloads
     which are sensitive to writer starvation.

     The alternative solution would be to allow only a single reader
     which has been tried and discarded as it is a major bottleneck
     especially for mmap_sem. Aside of that many of the writer
     starvation critical usage sites have been converted to a writer
     side mutex/spinlock and RCU read side protections in the past
     decade so that the issue is less prominent than it used to be.

   - The actual rtmutex based lock substitutions for PREEMPT_RT enabled
     kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
     rwlock_t. The spin/rw_lock*() functions disable migration across
     the critical section to preserve the existing semantics vs per-CPU
     variables.

   - Rework of the futex REQUEUE_PI mechanism to handle the case of
     early wake-ups which interleave with a re-queue operation to
     prevent the situation that a task would be blocked on both the
     rtmutex associated to the outer futex and the rtmutex based hash
     bucket spinlock.

     While this situation cannot happen on !RT enabled kernels the
     changes make the underlying concurrency problems easier to
     understand in general. As a result the difference between !RT and
     RT kernels is reduced to the handling of waiting for the critical
     section. !RT kernels simply spin-wait as before and RT kernels
     utilize rcu_wait().

   - The substitution of local_lock for PREEMPT_RT with a spinlock which
     protects the critical section while staying preemptible. The CPU
     locality is established by disabling migration.

  The underlying concepts of this code have been in use in PREEMPT_RT for
  way more than a decade. The code has been refactored several times over
  the years and this final incarnation has been optimized once again to be
  as non-intrusive as possible, i.e. the RT specific parts are mostly
  isolated.

  It has been extensively tested in the 5.14-rt patch series and it has
  been verified that !RT kernels are not affected by these changes"

* tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits)
  locking/rtmutex: Return success on deadlock for ww_mutex waiters
  locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
  locking/rtmutex: Dequeue waiter on ww_mutex deadlock
  locking/rtmutex: Dont dereference waiter lockless
  locking/semaphore: Add might_sleep() to down_*() family
  locking/ww_mutex: Initialize waiter.ww_ctx properly
  static_call: Update API documentation
  locking/local_lock: Add PREEMPT_RT support
  locking/spinlock/rt: Prepare for RT local_lock
  locking/rtmutex: Add adaptive spinwait mechanism
  locking/rtmutex: Implement equal priority lock stealing
  preempt: Adjust PREEMPT_LOCK_OFFSET for RT
  locking/rtmutex: Prevent lockdep false positive with PI futexes
  futex: Prevent requeue_pi() lock nesting issue on RT
  futex: Simplify handle_early_requeue_pi_wakeup()
  futex: Reorder sanity checks in futex_requeue()
  futex: Clarify comment in futex_requeue()
  futex: Restructure futex_requeue()
  futex: Correct the number of requeued waiters for PI
  futex: Remove bogus condition for requeue PI
  ...
2021-08-30 14:26:36 -07:00
Linus Torvalds 08403e2174 SMP core updates:
- Replace get/put_online_cpus() in various places. The final removal will
     happen shortly before v5.15-rc1 when the rest of the patches have been
     merged.
 
   - Add debug code to help the analysis of CPU hotplug failures
 
   - A set of kernel doc updates
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnvcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeydD/9b1eDTA/bLabrByKYCxOYOT3VvHPlB
 ik7obOtgR3xFrvINqIsLGfGIASCUFqGZuLz801+ZJSwtrhVGzZiBztyU5ZvDwTVT
 fwn+Trkis2RxbWh3T0qM+GbXJofsdbSQiO6gd/Nfn5hmSeXY/RZ118TIodCl0My9
 IcYAt4u9U0E4LEfIwOMhJCesXKgrTU9mcCpSrnfPt2q6zAMMlNQE6Ty0uzGWDKaA
 ejp7i7/K6SDwyPadWICriVNmE3WyiKOc8vCep96CHAIQgSOz5O+OXmeCJistRv+a
 cu5yxYeMCy+sYQnixfmC6VcCdWv677/d1CQRrG2ze9kHT7i/8uoJFewp6uZvbR6g
 KAufsZfYS7EaEqNWUVLiAT3cxtcjJx0lb5EL1QlaolQWNtEptXYjEd/CNuvGUt+h
 YzccIVtlqrBXjsrxkmhubZZNp35QwPhdAeMspF/xJxBztQhZCwUzD4L6DlVkH7j4
 hN62ezVzoWpmfbfyDB57DJF3sPOtCaeiyV9ZNKq9975/6BRWRSmrTPI1S2PGhXHU
 6NMKvt/Ackn6u1CjeNxq4jDuJbnMpiFXnEGG8md8ePJiF7mdAoG3EeBR6qgmkSR9
 MaR8C4hVLoDUSfXB80ef7iaIVYDdabmYCZ9JfEzmyk/j5Yt2Z1x0Mvwy63PeE+0q
 Zrf5KUtrGGANyg==
 =yqLB
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP core updates from Thomas Gleixner:

 - Replace get/put_online_cpus() in various places. The final removal
   will happen shortly before v5.15-rc1 when the rest of the patches
   have been merged.

 - Add debug code to help the analysis of CPU hotplug failures

 - A set of kernel doc updates

* tag 'smp-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm: Replace deprecated CPU-hotplug functions.
  md/raid5: Replace deprecated CPU-hotplug functions.
  Documentation: Replace deprecated CPU-hotplug functions.
  smp: Fix all kernel-doc warnings
  cpu/hotplug: Add debug printks for hotplug callback failures
  cpu/hotplug: Use DEVICE_ATTR_*() macro
  cpu/hotplug: Eliminate all kernel-doc warnings
  cpu/hotplug: Fix kernel doc warnings for __cpuhp_setup_state_cpuslocked()
  cpu/hotplug: Fix comment typo
  smpboot: Replace deprecated CPU-hotplug functions.
2021-08-30 14:10:07 -07:00
Linus Torvalds 4a2b88eb02 Perf events changes for v5.15 are:
- Add support for Intel Sapphire Rapids server CPU uncore events
  - Allow the AMD uncore driver to be built as a module
  - Misc cleanups and fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrWgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPbg/+LTi1ki2kswWF5Fmoo0IWZ1GiCkWD0rSm
 SqCJR3u/MHd7QxsfPXFuhABt7mqaC50epfim0MNC1J11wRoNy6GzM/9nICdez9CK
 c2c3jAwoEfWE7o5PTSjXolH0FFXVYwQ9WBRJTBCaZjuXUW7TOk7h9o8fXoF8SssU
 VkP9z0YKanhN480v4k7/KR20ktY6GPFFu5cxhC0wyygZXIoG6ku+nmDjvN1ipo54
 JQYQbCTz+dj/C0uZehbEoTZWx7cajAxlbq+Iyor3ND30YHLeRxWNEcq/Jzn9Szqa
 LXnIi3mg4/mHc8mtZjDHXazpMxGYI02hkBACftPb2gizR4DOwtWlw1A2oHjbfmvM
 oF29kcZmFU/Na2O5JTf0IpV4LLkt/OlrTsTcd5ROYWgmri3UV18MhaKz0R0vQNhI
 8Xx6TJVA9fFHCIONg+4yxzDkYlxHJ9Tg8yFb06M6NWOnud03xYv4FGOzi0laoums
 8XbYZUnMh8aVkXz05CYUadknu/ajMOSqAZZAstng3unazrutSCMkZ+Pzs4X+yqq5
 Zz2Tb26oi6KepLD0eQNTmo1pRwuWC/IBqVF5aKH4e4qgyta/VxWob3Qd+I7ONaCl
 6HpbaWz01Nw8U2wE4dQB0wKwIGlLE8bRQTS2QFuqaEHu0yZsQGt8zMwVWB+f93Gb
 viglPyeA/y8=
 =QrPT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf event updates from Ingo Molnar:

 - Add support for Intel Sapphire Rapids server CPU uncore events

 - Allow the AMD uncore driver to be built as a module

 - Misc cleanups and fixes

* tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header
  perf/amd/uncore: Allow the driver to be built as a module
  x86/cpu: Add get_llc_id() helper function
  perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/
  perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL
  perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
  perf/x86/intel: Replace deprecated CPU-hotplug functions
  perf/x86: Remove unused assignment to pointer 'e'
  perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX
  perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server
  perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server
  perf/x86/intel/uncore: Factor out snr_uncore_mmio_map()
  perf/x86/intel/uncore: Add alias PMU name
  perf/x86/intel/uncore: Add Sapphire Rapids server MDF support
  perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support
  perf/x86/intel/uncore: Add Sapphire Rapids server UPI support
  perf/x86/intel/uncore: Add Sapphire Rapids server M2M support
  perf/x86/intel/uncore: Add Sapphire Rapids server IMC support
  perf/x86/intel/uncore: Add Sapphire Rapids server PCU support
  perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support
  ...
2021-08-30 13:50:20 -07:00
Linus Torvalds 5d3c0db459 Scheduler changes for v5.15 are:
- The biggest change in this cycle is scheduler support for asymmetric
   scheduling affinity, to support the execution of legacy 32-bit tasks on
   AArch32 systems that also have 64-bit-only CPUs.
 
   Architectures can fill in this functionality by defining their
   own task_cpu_possible_mask(p). When this is done, the scheduler will
   make sure the task will only be scheduled on CPUs that support it.
 
   (The actual arm64 specific changes are not part of this tree.)
 
   For other architectures there will be no change in functionality.
 
 - Add cgroup SCHED_IDLE support
 
 - Increase node-distance flexibility & delay determining it until a CPU
   is brought online. (This enables platforms where node distance isn't
   final until the CPU is only.)
 
 - Deadline scheduler enhancements & fixes
 
 - Misc fixes & cleanups.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrDgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gMxBAAmzXPnDm1pDBBUaEwc+DynNGHNxZcBO5E
 CaNyfywp4GMA+OC3JzUgDg1B9uvKQRdBGtv6SZ8OcyhJMfmkEvjt5/wYUrcdtQVP
 TA2lt80/Is8LQMnvcz7X0gmsLt+fXWQTF8ik1KT4wsi/k03Xw8BH11zHct6sV2QN
 NNQ+7BEjqU1HA1UXJFiaoGtWF0gdh29VyE5dSzfAis79L0XUQadS512LJKin/AK0
 wYz8E+L7QIrjhfX9FQdOrR6da4TK6jAXyEY6a9dpaMHnFdtxuwhT4/BPtovNTeeY
 yxEZm3qSZbpghWHsMEa6Z4GIeLE6aNi3wcHt10fgdZDdotSRsNZuF6gi4A8nhRC+
 6wm+fCcFGEIBCL6eE/16Wms6YMdFfuiEAgtJGNy7GGyfH3/mS6u8eylXbLZncYXn
 DFHY+xUvmVZSzoPzcnYXEy4FB3kywNL7WBFxyhdXf5/EvWmmtHi4K3jVQ8jaqvhL
 MDk3NX9Hd0ariff3zUltWhMY5ouj6bIbBZmWWnD3s1xQT68VvE563cq0qH15dlnr
 j5M71eNRWvoOdZKzflgjRZzmdQtsZQ51tiMA6W6ZRfwYkHjb70qiia0r5GFf41X1
 MYelmcaA8+RjKrQ5etxzzDjoXl0xDXiZric6gRQHjG1Y1Zm2rVaoD+vkJGD5TQJ0
 2XTOGQgAxh4=
 =VdGE
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - The biggest change in this cycle is scheduler support for asymmetric
   scheduling affinity, to support the execution of legacy 32-bit tasks
   on AArch32 systems that also have 64-bit-only CPUs.

   Architectures can fill in this functionality by defining their own
   task_cpu_possible_mask(p). When this is done, the scheduler will make
   sure the task will only be scheduled on CPUs that support it.

   (The actual arm64 specific changes are not part of this tree.)

   For other architectures there will be no change in functionality.

 - Add cgroup SCHED_IDLE support

 - Increase node-distance flexibility & delay determining it until a CPU
   is brought online. (This enables platforms where node distance isn't
   final until the CPU is only.)

 - Deadline scheduler enhancements & fixes

 - Misc fixes & cleanups.

* tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  eventfd: Make signal recursion protection a task bit
  sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
  sched: Introduce dl_task_check_affinity() to check proposed affinity
  sched: Allow task CPU affinity to be restricted on asymmetric systems
  sched: Split the guts of sched_setaffinity() into a helper function
  sched: Introduce task_struct::user_cpus_ptr to track requested affinity
  sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
  cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
  cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
  sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
  sched: Cgroup SCHED_IDLE support
  sched/topology: Skip updating masks for non-online nodes
  sched: Replace deprecated CPU-hotplug functions.
  sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  sched: Fix UCLAMP_FLAG_IDLE setting
  sched/deadline: Fix missing clock update in migrate_task_rq_dl()
  sched/fair: Avoid a second scan of target in select_idle_cpu
  sched/fair: Use prev instead of new target as recent_used_cpu
  sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
  ...
2021-08-30 13:42:10 -07:00
Linus Torvalds c7a5238ef6 s390 updates for 5.15 merge window
- Improve ftrace code patching so that stop_machine is not required anymore.
   This requires a small common code patch acked by Steven Rostedt:
   https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/
 
 - Enable KCSAN for s390. This comes with a small common code change to fix a
   compile warning. Acked by Marco Elver:
   https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
 
 - Add KFENCE support for s390. This also comes with a minimal x86 patch from
   Marco Elver who said also this can be carried via the s390 tree:
   https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/
 
 - More changes to prepare the decompressor for relocation.
 
 - Enable DAT also for CPU restart path.
 
 - Final set of register asm removal patches; leaving only three locations where
   needed and sane.
 
 - Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO support to
   hwcaps flags.
 
 - Cleanup hwcaps implementation.
 
 - Add new instructions to in-kernel disassembler.
 
 - Various QDIO cleanups.
 
 - Add SCLP debug feature.
 
 - Various other cleanups and improvements all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmEs1GsACgkQIg7DeRsp
 bsJ9+A/9FApCECNPgu6jOX4Ee+no+LxpCPUF8rvt56TFTLv7+Dhm7fJl0xQ9utsZ
 FyLMDAr1/FKdm2wBW23QZH4vEIt1bd6e/03DwwK+6IjHKZHRIfB8eGJMsLj/TDzm
 K6/+FI7qXjvpNXxgkCqXf5yESi/y5Dgr+16kTBhPZj5awRiwe5puPamji3uiQ45V
 r4MdGCCC9BnTZvtPpUrr8ImnUqHJ4/TMo1YYdykLbZFuAvvYUyZ5YG5kh0pMa8JZ
 DGJpfLQfy7ZNscIzdVhZtfzzESVtS6/AOeBzDMO1pbM1CGXtvpJJP0Wjlr/PGwoW
 fvuMHpqTlDi+TfNZiPP5lwsFC89xSd6gtZH7vAuI8kFCXgW3RMjABF6h/mzpH1WO
 jXyaSEZROc/83gxPMYyOYiqrKyAFPbpZ/Rnav2bvGQGneqx7RvmpF3GgA9WEo1PW
 rMDoEbLstJuHk0E2uEV+OnQd5F7MHNonzpYfp/7pyQ+PL8w2GExV9yngVc/f3TqG
 HYLC9rc3K6DkxZappcJm0qTb7lDTMFI7xK3g9RiqPQBJE1v1MYE/rai48nW69ypE
 bRNL76AjyXKo+zKR2wlhJVMY1I1+DarMopHhZj6fzQT5te1LLsv8OuTU2gkt6dIq
 YoSYOYvModf3HbKnJul2tszQG9yl+vpE9MiCyBQSsxIYXCriq/c=
 =WDRh
 -----END PGP SIGNATURE-----

Merge tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Heiko Carstens:

 - Improve ftrace code patching so that stop_machine is not required
   anymore. This requires a small common code patch acked by Steven
   Rostedt:

     https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/

 - Enable KCSAN for s390. This comes with a small common code change to
   fix a compile warning. Acked by Marco Elver:

     https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com

 - Add KFENCE support for s390. This also comes with a minimal x86 patch
   from Marco Elver who said also this can be carried via the s390 tree:

     https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/

 - More changes to prepare the decompressor for relocation.

 - Enable DAT also for CPU restart path.

 - Final set of register asm removal patches; leaving only three
   locations where needed and sane.

 - Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO
   support to hwcaps flags.

 - Cleanup hwcaps implementation.

 - Add new instructions to in-kernel disassembler.

 - Various QDIO cleanups.

 - Add SCLP debug feature.

 - Various other cleanups and improvements all over the place.

* tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (105 commits)
  s390: remove SCHED_CORE from defconfigs
  s390/smp: do not use nodat_stack for secondary CPU start
  s390/smp: enable DAT before CPU restart callback is called
  s390: update defconfigs
  s390/ap: fix state machine hang after failure to enable irq
  KVM: s390: generate kvm hypercall functions
  s390/sclp: add tracing of SCLP interactions
  s390/debug: add early tracing support
  s390/debug: fix debug area life cycle
  s390/debug: keep debug data on resize
  s390/diag: make restart_part2 a local label
  s390/mm,pageattr: fix walk_pte_level() early exit
  s390: fix typo in linker script
  s390: remove do_signal() prototype and do_notify_resume() function
  s390/crypto: fix all kernel-doc warnings in vfio_ap_ops.c
  s390/pci: improve DMA translation init and exit
  s390/pci: simplify CLP List PCI handling
  s390/pci: handle FH state mismatch only on disable
  s390/pci: fix misleading rc in clp_set_pci_fn()
  s390/boot: factor out offset_vmlinux_info() function
  ...
2021-08-30 13:07:15 -07:00
Linus Torvalds 44a7d44411 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "Algorithms:

   - Add AES-NI/AVX/x86_64 implementation of SM4.

  Drivers:

   - Add Arm SMCCC TRNG based driver"

[ And obviously a lot of random fixes and updates  - Linus]

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (84 commits)
  crypto: sha512 - remove imaginary and mystifying clearing of variables
  crypto: aesni - xts_crypt() return if walk.nbytes is 0
  padata: Remove repeated verbose license text
  crypto: ccp - Add support for new CCP/PSP device ID
  crypto: x86/sm4 - add AES-NI/AVX2/x86_64 implementation
  crypto: x86/sm4 - export reusable AESNI/AVX functions
  crypto: rmd320 - remove rmd320 in Makefile
  crypto: skcipher - in_irq() cleanup
  crypto: hisilicon - check _PS0 and _PR0 method
  crypto: hisilicon - change parameter passing of debugfs function
  crypto: hisilicon - support runtime PM for accelerator device
  crypto: hisilicon - add runtime PM ops
  crypto: hisilicon - using 'debugfs_create_file' instead of 'debugfs_create_regset32'
  crypto: tcrypt - add GCM/CCM mode test for SM4 algorithm
  crypto: testmgr - Add GCM/CCM mode test of SM4 algorithm
  crypto: tcrypt - Fix missing return value check
  crypto: hisilicon/sec - modify the hardware endian configuration
  crypto: hisilicon/sec - fix the abnormal exiting process
  crypto: qat - store vf.compatible flag
  crypto: qat - do not export adf_iov_putmsg()
  ...
2021-08-30 12:57:10 -07:00
Linus Torvalds 4ca4256453 Merge branch 'core-rcu.2021.08.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
 "RCU changes for this cycle were:

   - Documentation updates

   - Miscellaneous fixes

   - Offloaded-callbacks updates

   - Updates to the nolibc library

   - Tasks-RCU updates

   - In-kernel torture-test updates

   - Torture-test scripting, perhaps most notably the pinning of
     torture-test guest OSes so as to force differences in memory
     latency. For example, in a two-socket system, a four-CPU guest OS
     will have one pair of its CPUs pinned to threads in a single core
     on one socket and the other pair pinned to threads in a single core
     on the other socket. This approach proved able to force race
     conditions that earlier testing missed. Some of these race
     conditions are still being tracked down"

* 'core-rcu.2021.08.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (61 commits)
  torture: Replace deprecated CPU-hotplug functions.
  rcu: Replace deprecated CPU-hotplug functions
  rcu: Print human-readable message for schedule() in RCU reader
  rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCU
  rcu: Use per_cpu_ptr to get the pointer of per_cpu variable
  rcu: Remove useless "ret" update in rcu_gp_fqs_loop()
  rcu: Mark accesses in tree_stall.h
  rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack
  rcu: Mark lockless ->qsmask read in rcu_check_boost_fail()
  srcutiny: Mark read-side data races
  rcu: Start timing stall repetitions after warning complete
  rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()
  rcu/tree: Handle VM stoppage in stall detection
  rculist: Unify documentation about missing list_empty_rcu()
  rcu: Mark accesses to ->rcu_read_lock_nesting
  rcu: Weaken ->dynticks accesses and updates
  rcu: Remove special bit at the bottom of the ->dynticks counter
  rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
  rcu: Fix to include first blocked task in stall warning
  torture: Make kvm-test-1-run-qemu.sh check for reboot loops
  ...
2021-08-30 12:48:01 -07:00
Kees Cook d20d30ebb1 cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a316752 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:

In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
                 from include/linux/compiler.h:264,
                 from include/uapi/linux/swab.h:6,
                 from include/linux/swab.h:5,
                 from arch/arm/include/asm/opcodes.h:86,
                 from arch/arm/include/asm/bug.h:7,
                 from include/linux/bug.h:5,
                 from include/linux/thread_info.h:13,
                 from include/asm-generic/current.h:5,
                 from ./arch/arm/include/generated/asm/current.h:1,
                 from include/linux/sched.h:12,
                 from include/linux/cgroup.h:12,
                 from kernel/cgroup/cgroup-internal.h:5,
                 from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
  651 |   return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-30 07:36:43 -10:00
Rafael J. Wysocki fe583359dd Merge branches 'pm-pci', 'pm-sleep', 'pm-domains' and 'powercap'
* pm-pci:
  PCI: PM: Enable PME if it can be signaled from D3cold
  PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently
  PCI: Use pci_update_current_state() in pci_enable_device_flags()

* pm-sleep:
  PM: sleep: unmark 'state' functions as kernel-doc
  PM: sleep: check RTC features instead of ops in suspend_test
  PM: sleep: s2idle: Replace deprecated CPU-hotplug functions

* pm-domains:
  PM: domains: Fix domain attach for CONFIG_PM_OPP=n
  arm64: dts: sc7180: Add required-opps for i2c
  PM: domains: Add support for 'required-opps' to set default perf state
  opp: Don't print an error if required-opps is missing

* powercap:
  powercap: Add Power Limit4 support for Alder Lake SoC
  powercap: intel_rapl: Replace deprecated CPU-hotplug functions
2021-08-30 19:25:42 +02:00
Rafael J. Wysocki 88e9c0bf1c Merge branches 'pm-cpufreq', 'pm-cpu' and 'pm-em'
* pm-cpufreq:
  cpufreq: intel_pstate: Process HWP Guaranteed change notification
  thermal: intel: Allow processing of HWP interrupt
  cpufreq: schedutil: Use kobject release() method to free sugov_tunables
  cpufreq: Replace deprecated CPU-hotplug functions

* pm-cpu:
  notifier: Remove atomic_notifier_call_chain_robust()
  PM: cpu: Make notifier chain use a raw_spinlock_t

* pm-em:
  PM: EM: Increase energy calculation precision
2021-08-30 19:25:13 +02:00
Linus Torvalds 3513431926 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmOoUACgkQnJ2qBz9k
 QNm8XggA6cbU79idj2HgDG5Wj4X3cd1+NlpPP2HytvTomwqYHJmHPAg0b44AsfHT
 ToHeYhIqb78MbLHn8fl4g5+O01raqihbUyIcRNcgdeQcle7Phuv/hDlsr8EOhfea
 mec4vottwaoiNn+4o7ciUAIri0stbVL8VPpnOy0X7iCmX7OC2vUvhQQGwvyfmXZm
 +F/8jppfvfsxl73SbWhG+sgD83D4ASZ+BGwC1Fx9tHitOVTJzg+CR2C1J4seaK6E
 r8+u/K6YKSrCGgA2dOVgzYyp2vHPGbHS4Z3H1HDXS+mtqxHJYqPR54kRT1BqCEQX
 Nm4CP7+u6a+86HLpBZi1NhvYrgLTig==
 =LJaZ
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "fsnotify speedups when notification actually isn't used and support
  for identifying processes which caused fanotify events through pidfd
  instead of normal pid"

* tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: optimize the case of no marks of any type
  fsnotify: count all objects with attached connectors
  fsnotify: count s_fsnotify_inode_refs for attached connectors
  fsnotify: replace igrab() with ihold() on attach connector
  fanotify: add pidfd support to the fanotify API
  fanotify: introduce a generic info record copying helper
  fanotify: minor cosmetic adjustments to fid labels
  kernel/pid.c: implement additional checks upon pidfd_create() parameters
  kernel/pid.c: remove static qualifier from pidfd_create()
2021-08-30 10:04:31 -07:00
Petr Mladek c985aafb60 Merge branch 'rework/printk_safe-removal' into for-linus 2021-08-30 16:36:10 +02:00
Petr Mladek 715d3edb79 Merge branch 'rework/fixup-for-5.15' into for-linus 2021-08-30 16:33:04 +02:00
Petr Mladek baa99c9267 Merge branch 'for-5.15-verbose-console' into for-linus 2021-08-30 14:56:28 +02:00
Thomas Gleixner 47fb0cfdb7 irqchip updates for Linux 5.15
API updates:
 
 - Treewide conversion to generic_handle_domain_irq() for anything
   that looks like a chained interrupt controller
 
 - Update the irqdomain documentation
 
 - Use of bitmap_zalloc() throughout the tree
 
 New functionalities:
 
 - Support for GICv3 EPPI partitions
 
 Fixes:
 
 - Qualcomm PDC hierarchy fixes
 
 - Yet another priority decoding fix for the GICv3 pseudo-NMIs
 
 - Fix the apple-aic driver irq_eoi() callback to always unmask
   the interrupt
 
 - Properly handle edge interrupts on loongson-pch-pic
 
 - Let the mtk-sysirq driver advertise IRQCHIP_SKIP_SET_WAKE
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmEqIhgPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDjPkP/Rtp6WNZ1QUfJWmHovnh/Wc6ob1DXcBwi9nX
 hy4miIJ1SWuez9G49RlQAiXZoB28B6KKCKKmouiqu7ke7WUhifS0K1ej188wjxRX
 dqRG+m9yBAqKSr0lyWLB5VVCc8XBz4oZTc28n585gHiXfAPv7u0EzW+zNrnloLU5
 NrAj6ppGFUzVT0VxRqcurbymE6OwRWjc3D+z/PhtHZ4SFOhft95CXgsdvMklqyLj
 wwiuZ0Dhj5EruSP/Z7DzbbXnMNmte3HC2/cUNPYkho4/rk+2gVnYv5kVdfPHKQCY
 Wjti/kvuPC3hdTvdw8g7VQfP63R3clZhcQ8s+myoeX5LWzyAHpoxAtdsbX7oVsgs
 aKyrFhddEFVuiFizYyweS89pL0kCkTob8/zlGeuhRiVRTZ3+kG7Zf2UTTnN1ZdLw
 2lMolghiGk4LYJfr83+CDZyYP/VGHDCthfrmd//l39P2wJhkuCDbbeKaElLGWvUt
 abnPf0buCRqMAJe7vh0GHCx7290nEh2IqyHR4AYRVhRaN7bfAXdZH9Xp6ZGT25Fz
 uORgUbGAyhd5Ics/7twE4qeOkfJ6fxwgXsOlx90EfgVYyDJ1sBHNx8Buo2z8Bl/2
 rwCsW49kU7yX/wp11sJctR72RuLKC23dxS6z7aSWkRc6k3u+8xl2eeLIN59FNrKZ
 ToTdbXEQ
 =bdm2
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

- API updates:

  - Treewide conversion to generic_handle_domain_irq() for anything
    that looks like a chained interrupt controller

  - Update the irqdomain documentation

  - Use of bitmap_zalloc() throughout the tree

- New functionalities:

  - Support for GICv3 EPPI partitions

- Fixes:

  - Qualcomm PDC hierarchy fixes

  - Yet another priority decoding fix for the GICv3 pseudo-NMIs

  - Fix the apple-aic driver irq_eoi() callback to always unmask
    the interrupt

  - Properly handle edge interrupts on loongson-pch-pic

  - Let the mtk-sysirq driver advertise IRQCHIP_SKIP_SET_WAKE

Link: https://lore.kernel.org/r/20210828121013.2647964-1-maz@kernel.org
2021-08-29 21:19:50 +02:00
Linus Torvalds 537b57bd5a - Have get_push_task() check whether current has migration disabled and
thus avoid useless invocations of the migration thread
 
 - Rework initialization flow so that all rq->core's are initialized,
 even of CPUs which have not been onlined yet, so that iterating over
 them all works as expected
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmErbMkACgkQEsHwGGHe
 VUqHnA//XleDtzYa1oswLB2e/jfUU9OUIijLYWGxNxo+zwJ/cBGNZWIUd6EIN8G9
 JXvuQEjKsHu68MsT/Ctwrs2C3TnROgIwX1l2p4PhZzGeUGoOhC1Ec11/9oobGN3l
 wqPYXsAAW3/tIm1KjkKEh2vfDomB51cjUYjn7cbEJHWX93m40dD3dHnerAqrfL3d
 /yZID5ANVCyMVgV9LDFXFA8ct0vMsM1lt8WsLI6s8zQI7uzmsP4PwIeMqmYl2bEZ
 nnRZnBRxTwZROyf5G+/8X+7mj9aiQ0T01D6+xOMtwN6IoSrw/05sArTB//mB8iOf
 xile4RJtzpJ+ukw/lKyNY+1nRM21NOnbPyhnmOxBom49U4ZkkEspwHOY0gubHzxN
 ka36t5ADSdbz8QQqO/f+wEgEHpHeFOOmQU6eMeHxZQ1ZRR3VWkkHnOyuXjhz8vH4
 Kd8C0M/1p6oecHzJdpLFJxOFFvSd1awFQImJJ8UxY8OttPscnqbcy2SzHSIF4JtY
 hSacqzdI/Ppk7/ZCp5x1+PnLL9xtXMQhyyUUPCyqpoLNoPtFIzIZdOe9CrTX+FaM
 vf13LnQPG/K84OsXWFvqyISJwpcb9GP0kec2ZeiFPmdo/nkzuTCgZcOg5IaNSER0
 //ItKZOfH2JK2PG8W3G0UOwLzhpsRkD3qIg48I2JM72ADJIIBUI=
 =ZUE4
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Have get_push_task() check whether current has migration disabled and
   thus avoid useless invocations of the migration thread

 - Rework initialization flow so that all rq->core's are initialized,
   even of CPUs which have not been onlined yet, so that iterating over
   them all works as expected

* tag 'sched_urgent_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix get_push_task() vs migrate_disable()
  sched: Fix Core-wide rq->lock for uninitialized CPUs
2021-08-29 10:54:14 -07:00
Paul E. McKenney d25a025201 clocksource: Make clocksource watchdog test safe for slow-HZ systems
The clocksource watchdog test sets a local JIFFIES_SHIFT macro and assumes
that HZ is >= 100. For smaller HZ values this shift value is too large and
causes undefined behaviour.

Move the HZ-based definitions of JIFFIES_SHIFT from kernel/time/jiffies.c
to kernel/time/tick-internal.h so the clocksource watchdog test can utilize
them, which makes it work correctly with all HZ values.

[ tglx: Resolved conflicts and massaged changelog ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20210812000133.GA402890@paulmck-ThinkPad-P17-Gen-1/
2021-08-28 17:01:32 +02:00
YueHaibing bc17bed5fd printk/index: Fix -Wunused-function warning
If CONFIG_MODULES is n, we got this:

kernel/printk/index.c:146:13: warning: ‘pi_remove_file’ defined but not used [-Wunused-function]

Move it inside #ifdef block to fix this warning.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210804130105.18732-1-yuehaibing@huawei.com
2021-08-27 16:42:44 +02:00
Peter Zijlstra a055fcc132 locking/rtmutex: Return success on deadlock for ww_mutex waiters
ww_mutexes can legitimately cause a deadlock situation in the lock graph
which is resolved afterwards by the wait/wound mechanics. The rtmutex chain
walk can detect such a deadlock and returns EDEADLK which in turn skips the
wait/wound mechanism and returns EDEADLK to the caller. That's wrong
because both lock chains might get EDEADLK or the wrong waiter would back
out.

Detect that situation and return 'success' in case that the waiter which
initiated the chain walk is a ww_mutex with context. This allows the
wait/wound mechanics to resolve the situation according to the rules.

[ tglx: Split it apart and added changelog ]

Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Fixes: add461325e ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
2021-08-27 14:28:49 +02:00
Peter Zijlstra 6467822b8c locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
rtmutex based ww_mutexes can legitimately create a cycle in the lock graph
which can be observed by a blocker which didn't cause the problem:

   P1: A, ww_A, ww_B
   P2: ww_B, ww_A
   P3: A

P3 might therefore be trapped in the ww_mutex induced cycle and run into
the lock depth limitation of rt_mutex_adjust_prio_chain() which returns
-EDEADLK to the caller.

Disable the deadlock detection walk when the chain walk observes a
ww_mutex to prevent this looping.

[ tglx: Split it apart and added changelog ]

Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Fixes: add461325e ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
2021-08-27 14:28:49 +02:00
Cai Huoqing cedcf527d5 padata: Remove repeated verbose license text
remove it because SPDX-License-Identifier is already used

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-08-27 16:30:18 +08:00
Jakub Kicinski 97c78d0af5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/wwan/mhi_wwan_mbim.c - drop the extra arg.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-26 17:57:57 -07:00
Eric W. Biederman d21918e5a9 signal/seccomp: Dump core when there is only one live thread
Replace get_nr_threads with atomic_read(&current->signal->live) as
that is a more accurate number that is decremented sooner.

Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87lf6z6qbd.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-26 18:06:41 -05:00
Linus Torvalds 8a2cb8bd06 Networking fixes for 5.14(-rc8?), including fixes from can and bpf.
Current release - regressions:
 
  - stmmac: revert "stmmac: align RX buffers"
 
  - usb: asix: ax88772: move embedded PHY detection as early as possible
 
  - usb: asix: do not call phy_disconnect() for ax88178
 
  - Revert "net: really fix the build...", from Kalle to fix QCA6390
 
 Current release - new code bugs:
 
  - phy: mediatek: add the missing suspend/resume callbacks
 
 Previous releases - regressions:
 
  - qrtr: fix another OOB Read in qrtr_endpoint_post
 
  - stmmac: dwmac-rk: fix unbalanced pm_runtime_enable warnings
 
 Previous releases - always broken:
 
  - inet: use siphash in exception handling
 
  - ip_gre: add validation for csum_start
 
  - bpf: fix ringbuf helper function compatibility
 
  - rtnetlink: return correct error on changing device netns
 
  - e1000e: do not try to recover the NVM checksum on Tiger Lake
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEn4nAACgkQMUZtbf5S
 IrtNnBAAnF6dxSdVOMZe0pScj4YLp3Vwfxi3sFTQ/9iUf3hbwyEZTntUdJ9xQjBd
 V8f+V7gorvPCEszYxZKAgqwZdEuOhNZdPzmEveug9Ln8AdV84RT5Pvh0PpY2Tzop
 jloh58+3vnNYJKUlrCavwKcG5eF+g/hZdgDMzp5hqFAqY1W4liZAR+u3LKYHggy2
 jAFk8/gRIzOHOAB0g4JuXwTUDhOxIKscUyJbvd8z/9X5MZLqnKvz8+tFIvU2ipJ+
 2P6Q7VmF57v8sDBII7tvpFqG1pR2X5JjgNasH3J1O1ttR268OlZkNwH/09vZe6Ih
 WZcebfcQWEOqv8HTFr992d9jHVHHnN8hlJkD1Co0yBJTsDbGfWhR3ngnKGvZ14is
 5RNjHgmHEvmCnIKaZkBI2pPP6HQBmxFinP12wldVa/Na0bpqjZpDs8YFZ11H74ST
 DP4CXR6YKrIRWCiIxT2NDbupIZwGVzRtzNAfjbjTTkN7wgRbVtcR2xkPxV4fiEcO
 DJ1cE//1fpj9m9W+Ln4evRfDmbCEMsyJjozlTub4cKqCiE6ywTuSa4OZ21/nI4/k
 LS0CT/VF9uRU9QSElHNPuMDIIKMnJYJobdZzXh9wmniG+MQw46XC69QOfO631Ly2
 BayvXf/5EvxLl0xY5Ub5K6PmECtZ/QLJCa+nxIWi/Btus9F6RfA=
 =ilTg
 -----END PGP SIGNATURE-----

Merge tag 'net-5.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes, including fixes from can and bpf.

  Closing three hw-dependent regressions. Any fixes of note are in the
  'old code' category. Nothing blocking release from our perspective.

  Current release - regressions:

   - stmmac: revert "stmmac: align RX buffers"

   - usb: asix: ax88772: move embedded PHY detection as early as
     possible

   - usb: asix: do not call phy_disconnect() for ax88178

   - Revert "net: really fix the build...", from Kalle to fix QCA6390

  Current release - new code bugs:

   - phy: mediatek: add the missing suspend/resume callbacks

  Previous releases - regressions:

   - qrtr: fix another OOB Read in qrtr_endpoint_post

   - stmmac: dwmac-rk: fix unbalanced pm_runtime_enable warnings

  Previous releases - always broken:

   - inet: use siphash in exception handling

   - ip_gre: add validation for csum_start

   - bpf: fix ringbuf helper function compatibility

   - rtnetlink: return correct error on changing device netns

   - e1000e: do not try to recover the NVM checksum on Tiger Lake"

* tag 'net-5.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits)
  Revert "net: really fix the build..."
  net: hns3: fix get wrong pfc_en when query PFC configuration
  net: hns3: fix GRO configuration error after reset
  net: hns3: change the method of getting cmd index in debugfs
  net: hns3: fix duplicate node in VLAN list
  net: hns3: fix speed unknown issue in bond 4
  net: hns3: add waiting time before cmdq memory is released
  net: hns3: clear hardware resource when loading driver
  net: fix NULL pointer reference in cipso_v4_doi_free
  rtnetlink: Return correct error on changing device netns
  net: dsa: hellcreek: Adjust schedule look ahead window
  net: dsa: hellcreek: Fix incorrect setting of GCL
  cxgb4: dont touch blocked freelist bitmap after free
  ipv4: use siphash instead of Jenkins in fnhe_hashfun()
  ipv6: use siphash in rt6_exception_hash()
  can: usb: esd_usb2: esd_usb2_rx_event(): fix the interchange of the CAN RX and TX error counters
  net: usb: asix: ax88772: fix boolconv.cocci warnings
  net/sched: ets: fix crash when flipping from 'strict' to 'quantum'
  qede: Fix memset corruption
  net: stmmac: fix kernel panic due to NULL pointer dereference of buf->xdp
  ...
2021-08-26 13:20:22 -07:00
Sebastian Andrzej Siewior e681dcbaa4 sched: Fix get_push_task() vs migrate_disable()
push_rt_task() attempts to move the currently running task away if the
next runnable task has migration disabled and therefore is pinned on the
current CPU.

The current task is retrieved via get_push_task() which only checks for
nr_cpus_allowed == 1, but does not check whether the task has migration
disabled and therefore cannot be moved either. The consequence is a
pointless invocation of the migration thread which correctly observes
that the task cannot be moved.

Return NULL if the task has migration disabled and cannot be moved to
another CPU.

Fixes: a7c81556ec ("sched: Fix migrate_disable() vs rt/dl balancing")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210826133738.yiotqbtdaxzjsnfj@linutronix.de
2021-08-26 19:02:00 +02:00
Eric W. Biederman 307d522f5e signal/seccomp: Refactor seccomp signal and coredump generation
Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c.  The function force_sig_seccomp takes a
parameter force_coredump to indicate that the sigaction field should
be reset to SIGDFL so that a coredump will be generated when the
signal is delivered.

force_sig_seccomp is then used to replace both seccomp_send_sigsys
and seccomp_init_siginfo.

force_sig_info_to_task gains an extra parameter to force using
the default signal action.

With this change seccomp is no longer a special case and there
becomes exactly one place do_coredump is called from.

Further it no longer becomes necessary for __seccomp_filter
to call do_group_exit.

Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-26 10:30:12 -05:00
Ingo Molnar 366e7ad6ba sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
It's not actually used in the !CONFIG_FAIR_GROUP_SCHED case:

  kernel/sched/fair.c:488:12: warning: ‘tg_is_idle’ defined but not used [-Wunused-function]

Keep around a placeholder nevertheless, for API completeness. Mark it inline,
so the compiler doesn't think it must be used.

Fixes: 304000390f88: ("sched: Cgroup SCHED_IDLE support")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Josh Don <joshdon@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
2021-08-26 10:49:24 +02:00
Sebastian Andrzej Siewior ffec09f9c7 perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210803141621.780504-12-bigeasy@linutronix.de
2021-08-26 09:14:36 +02:00
Daniel Xu eb529c5b10 bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
This commit fixes linker errors along the lines of:

    s390-linux-ld: task_iter.c:(.init.text+0xa4): undefined reference to `btf_task_struct_ids'`

Fix by defining btf_task_struct_ids unconditionally in kernel/bpf/btf.c
since there exists code that unconditionally uses btf_task_struct_ids.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/05d94748d9f4b3eecedc4fddd6875418a396e23c.1629942444.git.dxu@dxuuu.xyz
2021-08-25 19:41:39 -07:00
Martin KaFai Lau eb18b49ea7 bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
This patch allows the bpf-tcp-cc to call bpf_setsockopt.  One use
case is to allow a bpf-tcp-cc switching to another cc during init().
For example, when the tcp flow is not ecn ready, the bpf_dctcp
can switch to another cc by calling setsockopt(TCP_CONGESTION).

During setsockopt(TCP_CONGESTION), the new tcp-cc's init() will be
called and this could cause a recursion but it is stopped by the
current trampoline's logic (in the prog->active counter).

While retiring a bpf-tcp-cc (e.g. in tcp_v[46]_destroy_sock()),
the tcp stack calls bpf-tcp-cc's release().  To avoid the retiring
bpf-tcp-cc making further changes to the sk, bpf_setsockopt is not
available to the bpf-tcp-cc's release().  This will avoid release()
making setsockopt() call that will potentially allocate new resources.

Although the bpf-tcp-cc already has a more powerful way to read tcp_sock
from the PTR_TO_BTF_ID, it is usually expected that bpf_getsockopt and
bpf_setsockopt are available together.  Thus, bpf_getsockopt() is also
added to all tcp_congestion_ops except release().

When the old bpf-tcp-cc is calling setsockopt(TCP_CONGESTION)
to switch to a new cc, the old bpf-tcp-cc will be released by
bpf_struct_ops_put().  Thus, this patch also puts the bpf_struct_ops_map
after a rcu grace period because the trampoline's image cannot be freed
while the old bpf-tcp-cc is still running.

bpf-tcp-cc can only access icsk_ca_priv as SCALAR.  All kernel's
tcp-cc is also accessing the icsk_ca_priv as SCALAR.   The size
of icsk_ca_priv has already been raised a few times to avoid
extra kmalloc and memory referencing.  The only exception is the
kernel's tcp_cdg.c that stores a kmalloc()-ed pointer in icsk_ca_priv.
To avoid the old bpf-tcp-cc accidentally overriding this tcp_cdg's pointer
value stored in icsk_ca_priv after switching and without over-complicating
the bpf's verifier for this one exception in tcp_cdg, this patch does not
allow switching to tcp_cdg.  If there is a need, bpf_tcp_cdg can be
implemented and then use the bpf_sk_storage as the extended storage.

bpf_sk_setsockopt proto has only been recently added and used
in bpf-sockopt and bpf-iter-tcp, so impose the tcp_cdg limitation in the
same proto instead of adding a new proto specifically for bpf-tcp-cc.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173007.3976921-1-kafai@fb.com
2021-08-25 17:40:35 -07:00
Daniel Xu dd6e10fbd9 bpf: Add bpf_task_pt_regs() helper
The motivation behind this helper is to access userspace pt_regs in a
kprobe handler.

uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
kprobe handler. The final case (kernelspace pt_regs in uprobe) is
pretty rare (usermode helper) so I think that can be solved later if
necessary.

More concretely, this helper is useful in doing BPF-based DWARF stack
unwinding. Currently the kernel can only do framepointer based stack
unwinds for userspace code. This is because the DWARF state machines are
too fragile to be computed in kernelspace [0]. The idea behind
DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
stack (while in prog context) and send it up to userspace for unwinding
(probably with libunwind) [1]. This would effectively enable profiling
applications with -fomit-frame-pointer using kprobes and uprobes.

[0]: https://lkml.org/lkml/2012/2/10/356
[1]: https://github.com/danobi/bpf-dwarf-walk

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/e2718ced2d51ef4268590ab8562962438ab82815.1629772842.git.dxu@dxuuu.xyz
2021-08-25 10:37:05 -07:00
Daniel Xu a396eda551 bpf: Extend bpf_base_func_proto helpers with bpf_get_current_task_btf()
bpf_get_current_task() is already supported so it's natural to also
include the _btf() variant for btf-powered helpers.

This is required for non-tracing progs to use bpf_task_pt_regs() in the
next commit.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/f99870ed5f834c9803d73b3476f8272b1bb987c0.1629772842.git.dxu@dxuuu.xyz
2021-08-25 10:37:05 -07:00
Daniel Xu 33c5cb3601 bpf: Consolidate task_struct BTF_ID declarations
No need to have it defined 5 times. Once is enough.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/6dcefa5bed26fe1226f26683f36819bb53ec19a2.1629772842.git.dxu@dxuuu.xyz
2021-08-25 10:37:05 -07:00
Linus Torvalds 62add98208 Merge branch 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucount fixes from Eric Biederman:
 "This branch fixes a regression that made it impossible to increase
  rlimits that had been converted to the ucount infrastructure, and also
  fixes a reference counting bug where the reference was not incremented
  soon enough.

  The fixes are trivial and the bugs have been encountered in the wild,
  and the fixes have been tested"

* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: Increase ucounts reference counter before the security hook
  ucounts: Fix regression preventing increasing of rlimits in init_user_ns
2021-08-25 09:56:10 -07:00
Nicolas Saenz Julienne 9f72daf7ed cgroup/cpuset: Avoid memory migration when nodemasks match
With the introduction of ee9707e859 ("cgroup/cpuset: Enable memory
migration for cpuset v2") attaching a process to a different cgroup will
trigger a memory migration regardless of whether it's really needed.
Memory migration is an expensive operation, so bypass it if the
nodemasks passed to cpuset_migrate_mm() are equal.

Note that we're not only avoiding the migration work itself, but also a
call to lru_cache_disable(), which triggers and flushes an LRU drain
work on every online CPU.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-25 06:51:51 -10:00
Thomas Gleixner 37e8abff2b locking/rtmutex: Dequeue waiter on ww_mutex deadlock
The rt_mutex based ww_mutex variant queues the new waiter first in the
lock's rbtree before evaluating the ww_mutex specific conditions which
might decide that the waiter should back out. This check and conditional
exit happens before the waiter is enqueued into the PI chain.

The failure handling at the call site assumes that the waiter, if it is the
top most waiter on the lock, is queued in the PI chain and then proceeds to
adjust the unmodified PI chain, which results in RB tree corruption.

Dequeue the waiter from the lock waiter list in the ww_mutex error exit
path to prevent this.

Fixes: add461325e ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210825102454.042280541@linutronix.de
2021-08-25 15:42:33 +02:00
Thomas Gleixner c3123c4314 locking/rtmutex: Dont dereference waiter lockless
The new rt_mutex_spin_on_onwer() loop checks whether the spinning waiter is
still the top waiter on the lock by utilizing rt_mutex_top_waiter(), which
is broken because that function contains a sanity check which dereferences
the top waiter pointer to check whether the waiter belongs to the
lock. That's wrong in the lockless spinwait case:

 CPU 0							CPU 1
 rt_mutex_lock(lock)					rt_mutex_lock(lock);
   queue(waiter0)
   waiter0 == rt_mutex_top_waiter(lock)
   rt_mutex_spin_on_onwer(lock, waiter0) {		queue(waiter1)
   					 		waiter1 == rt_mutex_top_waiter(lock)
   							...
     top_waiter = rt_mutex_top_waiter(lock)
       leftmost = rb_first_cached(&lock->waiters);
							-> signal
							dequeue(waiter1)
							destroy(waiter1)
       w = rb_entry(leftmost, ....)
       BUG_ON(w->lock != lock)	 <- UAF

The BUG_ON() is correct for the case where the caller holds lock->wait_lock
which guarantees that the leftmost waiter entry cannot vanish. For the
lockless spinwait case it's broken.

Create a new helper function which avoids the pointer dereference and just
compares the leftmost entry pointer with current's waiter pointer to
validate that currrent is still elegible for spinning.

Fixes: 992caf7f17 ("locking/rtmutex: Add adaptive spinwait mechanism")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210825102453.981720644@linutronix.de
2021-08-25 15:42:32 +02:00
Richard Guy Briggs 67d69e9d1a audit: move put_tree() to avoid trim_trees refcount underflow and UAF
AUDIT_TRIM is expected to be idempotent, but multiple executions resulted
in a refcount underflow and use-after-free.

git bisect fingered commit fb041bb7c0	("locking/refcount: Consolidate
implementations of refcount_t") but this patch with its more thorough
checking that wasn't in the x86 assembly code merely exposed a previously
existing tree refcount imbalance in the case of tree trimming code that
was refactored with prune_one() to remove a tree introduced in
commit 8432c70062 ("audit: Simplify locking around untag_chunk()")

Move the put_tree() to cover only the prune_one() case.

Passes audit-testsuite and 3 passes of "auditctl -t" with at least one
directory watch.

Cc: Jan Kara <jack@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Seiji Nishikawa <snishika@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 8432c70062 ("audit: Simplify locking around untag_chunk()")
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
[PM: reformatted/cleaned-up the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-08-24 18:52:36 -04:00
Andrey Ignatov d7af7e497f bpf: Fix possible out of bound write in narrow load handling
Fix a verifier bug found by smatch static checker in [0].

This problem has never been seen in prod to my best knowledge. Fixing it
still seems to be a good idea since it's hard to say for sure whether
it's possible or not to have a scenario where a combination of
convert_ctx_access() and a narrow load would lead to an out of bound
write.

When narrow load is handled, one or two new instructions are added to
insn_buf array, but before it was only checked that

	cnt >= ARRAY_SIZE(insn_buf)

And it's safe to add a new instruction to insn_buf[cnt++] only once. The
second try will lead to out of bound write. And this is what can happen
if `shift` is set.

Fix it by making sure that if the BPF_RSH instruction has to be added in
addition to BPF_AND then there is enough space for two more instructions
in insn_buf.

The full report [0] is below:

kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array

kernel/bpf/verifier.c
    12282
    12283 			insn->off = off & ~(size_default - 1);
    12284 			insn->code = BPF_LDX | BPF_MEM | size_code;
    12285 		}
    12286
    12287 		target_size = 0;
    12288 		cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
    12289 					 &target_size);
    12290 		if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bounds check.

    12291 		    (ctx_field_size && !target_size)) {
    12292 			verbose(env, "bpf verifier is misconfigured\n");
    12293 			return -EINVAL;
    12294 		}
    12295
    12296 		if (is_narrower_load && size < target_size) {
    12297 			u8 shift = bpf_ctx_narrow_access_offset(
    12298 				off, size, size_default) * 8;
    12299 			if (ctx_field_size <= 4) {
    12300 				if (shift)
    12301 					insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
                                                         ^^^^^
increment beyond end of array

    12302 									insn->dst_reg,
    12303 									shift);
--> 12304 				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
                                                 ^^^^^
out of bounds write

    12305 								(1 << size * 8) - 1);
    12306 			} else {
    12307 				if (shift)
    12308 					insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
    12309 									insn->dst_reg,
    12310 									shift);
    12311 				insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
                                        ^^^^^^^^^^^^^^^
Same.

    12312 								(1ULL << size * 8) - 1);
    12313 			}
    12314 		}
    12315
    12316 		new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
    12317 		if (!new_prog)
    12318 			return -ENOMEM;
    12319
    12320 		delta += cnt - 1;
    12321
    12322 		/* keep walking new program and skip insns we just inserted */
    12323 		env->prog = new_prog;
    12324 		insn      = new_prog->insnsi + i + delta;
    12325 	}
    12326
    12327 	return 0;
    12328 }

[0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/

v1->v2:
- clarify that problem was only seen by static checker but not in prod;

Fixes: 46f53a65d2 ("bpf: Allow narrow loads with offset > 0")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com
2021-08-24 14:32:26 -07:00
Barry Song 2f170814bd genirq/msi: Move MSI sysfs handling from PCI to MSI core
Move PCI's MSI sysfs code to the irq core so that other busses such as
platform can reuse it.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210813035628.6844-2-21cnbao@gmail.com
2021-08-24 09:16:20 +02:00
Lee Jones 88ffe2d0a5 genirq/cpuhotplug: Demote debug printk to KERN_DEBUG
This sort of information is only generally useful when debugging.
No need to have these sprinkled through the kernel log otherwise.

Real world problem:

  During pre-release testing these have an affect on performance on
  real products.  To the point where so much logging builds up, that
  it sets off the watchdog(s) on some high profile consumer devices.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210816134817.1503661-1-lee.jones@linaro.org
2021-08-24 09:16:20 +02:00
Dave Marchevsky 6fc88c354f bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum
Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
attach types and a function to map bpf_attach_type values to the new
enum. Inspired by netns_bpf_attach_type.

Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
possible.  Functionality is unchanged as attach_type_to_prog_type
switches in bpf/syscall.c were preventing non-cgroup programs from
making use of the invalid cgroup_bpf array slots.

As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
arrays were sized using MAX_BPF_ATTACH_TYPE.

bpf_cgroup_storage is notably not migrated as struct
bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
member which is not meant to be opaque. Similarly, bpf_cgroup_link
continues to report its bpf_attach_type member to userspace via fdinfo
and bpf_link_info.

To ease disambiguation, bpf_attach_type variables are renamed from
'type' to 'atype' when changed to cgroup_bpf_attach_type.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com
2021-08-23 17:50:24 -07:00
Alexey Gladkov bbb6d0f3e1 ucounts: Increase ucounts reference counter before the security hook
We need to increment the ucounts reference counter befor security_prepare_creds()
because this function may fail and abort_creds() will try to decrement
this reference.

[   96.465056][ T8641] FAULT_INJECTION: forcing a failure.
[   96.465056][ T8641] name fail_page_alloc, interval 1, probability 0, space 0, times 0
[   96.478453][ T8641] CPU: 1 PID: 8641 Comm: syz-executor668 Not tainted 5.14.0-rc6-syzkaller #0
[   96.487215][ T8641] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[   96.497254][ T8641] Call Trace:
[   96.500517][ T8641]  dump_stack_lvl+0x1d3/0x29f
[   96.505758][ T8641]  ? show_regs_print_info+0x12/0x12
[   96.510944][ T8641]  ? log_buf_vmcoreinfo_setup+0x498/0x498
[   96.516652][ T8641]  should_fail+0x384/0x4b0
[   96.521141][ T8641]  prepare_alloc_pages+0x1d1/0x5a0
[   96.526236][ T8641]  __alloc_pages+0x14d/0x5f0
[   96.530808][ T8641]  ? __rmqueue_pcplist+0x2030/0x2030
[   96.536073][ T8641]  ? lockdep_hardirqs_on_prepare+0x3e2/0x750
[   96.542056][ T8641]  ? alloc_pages+0x3f3/0x500
[   96.546635][ T8641]  allocate_slab+0xf1/0x540
[   96.551120][ T8641]  ___slab_alloc+0x1cf/0x350
[   96.555689][ T8641]  ? kzalloc+0x1d/0x30
[   96.559740][ T8641]  __kmalloc+0x2e7/0x390
[   96.563980][ T8641]  ? kzalloc+0x1d/0x30
[   96.568029][ T8641]  kzalloc+0x1d/0x30
[   96.571903][ T8641]  security_prepare_creds+0x46/0x220
[   96.577174][ T8641]  prepare_creds+0x411/0x640
[   96.581747][ T8641]  __sys_setfsuid+0xe2/0x3a0
[   96.586333][ T8641]  do_syscall_64+0x3d/0xb0
[   96.590739][ T8641]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   96.596611][ T8641] RIP: 0033:0x445a69
[   96.600483][ T8641] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
[   96.620152][ T8641] RSP: 002b:00007f1054173318 EFLAGS: 00000246 ORIG_RAX: 000000000000007a
[   96.628543][ T8641] RAX: ffffffffffffffda RBX: 00000000004ca4c8 RCX: 0000000000445a69
[   96.636600][ T8641] RDX: 0000000000000010 RSI: 00007f10541732f0 RDI: 0000000000000000
[   96.644550][ T8641] RBP: 00000000004ca4c0 R08: 0000000000000001 R09: 0000000000000000
[   96.652500][ T8641] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ca4cc
[   96.660631][ T8641] R13: 00007fffffe0b62f R14: 00007f1054173400 R15: 0000000000022000

Fixes: 905ae01c4a ("Add a reference to ucounts for each cred")
Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/97433b1742c3331f02ad92de5a4f07d673c90613.1629735352.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-08-23 16:13:04 -05:00
Eric W. Biederman 5ddf994fa2 ucounts: Fix regression preventing increasing of rlimits in init_user_ns
"Ma, XinjianX" <xinjianx.ma@intel.com> reported:

> When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
> in kselftest failed with following message.
>
> # selftests: mqueue: mq_perf_tests
> #
> # Initial system state:
> #       Using queue path:                       /mq_perf_tests
> #       RLIMIT_MSGQUEUE(soft):                  819200
> #       RLIMIT_MSGQUEUE(hard):                  819200
> #       Maximum Message Size:                   8192
> #       Maximum Queue Size:                     10
> #       Nice value:                             0
> #
> # Adjusted system state for testing:
> #       RLIMIT_MSGQUEUE(soft):                  (unlimited)
> #       RLIMIT_MSGQUEUE(hard):                  (unlimited)
> #       Maximum Message Size:                   16777216
> #       Maximum Queue Size:                     65530
> #       Nice value:                             -20
> #       Continuous mode:                        (disabled)
> #       CPUs to pin:                            3
> # ./mq_perf_tests: mq_open() at 296: Too many open files
> not ok 2 selftests: mqueue: mq_perf_tests # exit=1
> ```
>
> Test env:
> rootfs: debian-10
> gcc version: 9

After investigation the problem turned out to be that ucount_max for
the rlimits in init_user_ns was being set to the initial rlimit value.
The practical problem is that ucount_max provides a limit that
applications inside the user namespace can not exceed.  Which means in
practice that rlimits that have been converted to use the ucount
infrastructure were not able to exceend their initial rlimits.

Solve this by setting the relevant values of ucount_max to
RLIM_INIFINITY.  A limit in init_user_ns is pointless so the code
should allow the values to grow as large as possible without riscking
an underflow or an overflow.

As the ltp test case was a bit of a pain I have reproduced the rlimit failure
and tested the fix with the following little C program:
> #include <stdio.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <mqueue.h>
> #include <sys/time.h>
> #include <sys/resource.h>
> #include <errno.h>
> #include <string.h>
> #include <stdlib.h>
> #include <limits.h>
> #include <unistd.h>
>
> int main(int argc, char **argv)
> {
> 	struct mq_attr mq_attr;
> 	struct rlimit rlim;
> 	mqd_t mqd;
> 	int ret;
>
> 	ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
> 	if (ret != 0) {
> 		fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
> 	printf("RLIMIT_MSGQUEUE %lu %lu\n",
> 	       rlim.rlim_cur, rlim.rlim_max);
> 	rlim.rlim_cur = RLIM_INFINITY;
> 	rlim.rlim_max = RLIM_INFINITY;
> 	ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
> 	if (ret != 0) {
> 		fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
>
> 	memset(&mq_attr, 0, sizeof(struct mq_attr));
> 	mq_attr.mq_maxmsg = 65536 - 1;
> 	mq_attr.mq_msgsize = 16*1024*1024 - 1;
>
> 	mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr);
> 	if (mqd == (mqd_t)-1) {
> 		fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
> 	ret = mq_close(mqd);
> 	if (ret) {
> 		fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
>
> 	return EXIT_SUCCESS;
> }

Fixes: 6e52a9f053 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
Fixes: d7c9e99aee ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Reported-by: kernel test robot lkp@intel.com
Acked-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-23 16:10:42 -05:00
Daniel Borkmann 5b029a32cf bpf: Fix ringbuf helper function compatibility
Commit 457f44363a ("bpf: Implement BPF ring buffer and verifier support
for it") extended check_map_func_compatibility() by enforcing map -> helper
function match, but not helper -> map type match.

Due to this all of the bpf_ringbuf_*() helper functions could be used with
a wrong map type such as array or hash map, leading to invalid access due
to type confusion.

Also, both BPF_FUNC_ringbuf_{submit,discard} have ARG_PTR_TO_ALLOC_MEM as
argument and not a BPF map. Therefore, their check_map_func_compatibility()
presence is incorrect since it's only for map type checking.

Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Ryota Shiga (Flatt Security)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-08-23 23:09:10 +02:00
Hao Xu f552a27afe io_uring: remove files pointer in cancellation functions
When doing cancellation, we use a parameter to indicate where it's from
do_exit or exec. So a boolean value is good enough for this, remove the
struct files* as it is not necessary.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:37 -06:00
Rafael J. Wysocki 43dde64bb1 Merge back cpufreq changes for v5.15. 2021-08-23 13:48:40 +02:00
Maulik Shah 131d326ba9 irqdomain: Export irq_domain_disconnect_hierarchy()
Export irq_domain_disconnect_hierarchy() so irqchip module drivers
can use it.

Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1629705880-27877-2-git-send-email-mkshah@codeaurora.org
2021-08-23 09:24:57 +01:00
Tzvetomir Stoyanov (VMware) 7491e2c442 tracing: Add a probe that attaches to trace events
A new dynamic event is introduced: event probe. The event is attached
to an existing tracepoint and uses its fields as arguments. The user
can specify custom format string of the new event, select what tracepoint
arguments will be printed and how to print them.
An event probe is created by writing configuration string in
'dynamic_events' ftrace file:
 e[:[SNAME/]ENAME] SYSTEM/EVENT [FETCHARGS]	- Set an event probe
 -:SNAME/ENAME					- Delete an event probe

Where:
 SNAME	- System name, if omitted 'eprobes' is used.
 ENAME	- Name of the new event in SNAME, if omitted the SYSTEM_EVENT is used.
 SYSTEM	- Name of the system, where the tracepoint is defined, mandatory.
 EVENT	- Name of the tracepoint event in SYSTEM, mandatory.
 FETCHARGS - Arguments:
  <name>=$<field>[:TYPE] - Fetch given filed of the tracepoint and print
			   it as given TYPE with given name. Supported
			   types are:
	                    (u8/u16/u32/u64/s8/s16/s32/s64), basic type
        	            (x8/x16/x32/x64), hexadecimal types
			    "string", "ustring" and bitfield.

Example, attach an event probe on openat system call and print name of the
file that will be opened:
 echo "e:esys/eopen syscalls/sys_enter_openat file=\$filename:string" >> dynamic_events
A new dynamic event is created in events/esys/eopen/ directory. It
can be deleted with:
 echo "-:esys/eopen" >> dynamic_events

Filters, triggers and histograms can be attached to the new event, it can
be matched in synthetic events. There is one limitation - an event probe
can not be attached to kprobe, uprobe or another event probe.

Link: https://lkml.kernel.org/r/20210812145805.2292326-1-tz.stoyanov@gmail.com
Link: https://lkml.kernel.org/r/20210819152825.142428383@goodmis.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-20 14:18:40 -04:00
Xiaoming Ni 99409b935c locking/semaphore: Add might_sleep() to down_*() family
Semaphore is sleeping lock. Add might_sleep() to down*() family
(with exception of down_trylock()) to detect atomic context sleep.

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210809021215.19991-1-nixiaoming@huawei.com
2021-08-20 12:33:17 +02:00
Will Deacon 234b8ab647 sched: Introduce dl_task_check_affinity() to check proposed affinity
In preparation for restricting the affinity of a task during execve()
on arm64, introduce a new dl_task_check_affinity() helper function to
give an indication as to whether the restricted mask is admissible for
a deadline task.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lore.kernel.org/r/20210730112443.23245-10-will@kernel.org
2021-08-20 12:33:00 +02:00
Will Deacon 07ec77a1d4 sched: Allow task CPU affinity to be restricted on asymmetric systems
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Although userspace can carefully manage the affinity masks for such
tasks, one place where it is particularly problematic is execve()
because the CPU on which the execve() is occurring may be incompatible
with the new application image. In such a situation, it is desirable to
restrict the affinity mask of the task and ensure that the new image is
entered on a compatible CPU. From userspace's point of view, this looks
the same as if the incompatible CPUs have been hotplugged off in the
task's affinity mask. Similarly, if a subsequent execve() reverts to
a compatible image, then the old affinity is restored if it is still
valid.

In preparation for restricting the affinity mask for compat tasks on
arm64 systems without uniform support for 32-bit applications, introduce
{force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict
and restore the affinity mask for a task based on the compatible CPUs.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210730112443.23245-9-will@kernel.org
2021-08-20 12:33:00 +02:00
Will Deacon db3b02ae89 sched: Split the guts of sched_setaffinity() into a helper function
In preparation for replaying user affinity requests using a saved mask,
split sched_setaffinity() up so that the initial task lookup and
security checks are only performed when the request is coming directly
from userspace.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lore.kernel.org/r/20210730112443.23245-8-will@kernel.org
2021-08-20 12:33:00 +02:00
Will Deacon b90ca8badb sched: Introduce task_struct::user_cpus_ptr to track requested affinity
In preparation for saving and restoring the user-requested CPU affinity
mask of a task, add a new cpumask_t pointer to 'struct task_struct'.

If the pointer is non-NULL, then the mask is copied across fork() and
freed on task exit.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lore.kernel.org/r/20210730112443.23245-7-will@kernel.org
2021-08-20 12:33:00 +02:00
Will Deacon 234a503e67 sched: Reject CPU affinity changes based on task_cpu_possible_mask()
Reject explicit requests to change the affinity mask of a task via
set_cpus_allowed_ptr() if the requested mask is not a subset of the
mask returned by task_cpu_possible_mask(). This ensures that the
'cpus_mask' for a given task cannot contain CPUs which are incapable of
executing it, except in cases where the affinity is forced.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210730112443.23245-6-will@kernel.org
2021-08-20 12:32:59 +02:00
Will Deacon 97c0054dbe cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
select_fallback_rq() only needs to recheck for an allowed CPU if the
affinity mask of the task has changed since the last check.

Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether
the affinity mask was updated, and use this to elide the allowed check
when the mask has been left alone.

No functional change.

Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org
2021-08-20 12:32:59 +02:00
Will Deacon 431c69fac0 cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
account when trying to find a suitable set of online CPUs for a given
task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
during ->attach() and will subsequently allow the cpuset hierarchy to be
taken into account when forcefully overriding the affinity mask for a
task which requires migration to a compatible CPU.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lkml.kernel.org/r/20210730112443.23245-4-will@kernel.org
2021-08-20 12:32:59 +02:00
Will Deacon d4b96fb92a cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.

In preparation for allowing architectures to provide their own fallback
mask, just return early if we're either using cgroup v1 or we're using
cgroup v2 with a mask that contains invalid CPUs. This will allow
select_fallback_rq() to figure out the mask by itself.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20210730112443.23245-3-will@kernel.org
2021-08-20 12:32:58 +02:00
Will Deacon 9ae606bc74 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

On such a system, we must take care not to migrate a task to an
unsupported CPU when forcefully moving tasks in select_fallback_rq()
in response to a CPU hot-unplug operation.

Introduce a task_cpu_possible_mask() hook which, given a task argument,
allows an architecture to return a cpumask of CPUs that are capable of
executing that task. The default implementation returns the
cpu_possible_mask, since sane machines do not suffer from per-cpu ISA
limitations that affect scheduling. The new mask is used when selecting
the fallback runqueue as a last resort before forcing a migration to the
first active CPU.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210730112443.23245-2-will@kernel.org
2021-08-20 12:32:58 +02:00
Josh Don 304000390f sched: Cgroup SCHED_IDLE support
This extends SCHED_IDLE to cgroups.

Interface: cgroup/cpu.idle.
 0: default behavior
 1: SCHED_IDLE

Extending SCHED_IDLE to cgroups means that we incorporate the existing
aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
descendant threads towards the idle_h_nr_running count of all of its
ancestor cgroups. Thus, sched_idle_rq() will work properly.
Additionally, SCHED_IDLE cgroups are configured with minimum weight.

There are two key differences between the per-task and per-cgroup
SCHED_IDLE interface:

  - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
    maintain their relative weights. The entity that is "idle" is the
    cgroup, not the tasks themselves.

  - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
    decision is not made by comparing the current task with the woken
    task, but rather by comparing their matching sched_entity.

A typical use-case for this is a user that creates an idle and a
non-idle subtree. The non-idle subtree will dominate competition vs
the idle subtree, but the idle subtree will still be high priority vs
other users on the system. The latter is accomplished via comparing
matching sched_entity in the waken preemption path (this could also be
improved by making the sched_idle_rq() decision dependent on the
perspective of a specific task).

For now, we maintain the existing SCHED_IDLE semantics. Future patches
may make improvements that extend how we treat SCHED_IDLE entities.

The per-task_group idle field is an integer that currently only holds
either a 0 or a 1. This is explicitly typed as an integer to allow for
further extensions to this API. For example, a negative value may
indicate a highly latency-sensitive cgroup that should be preferred
for preemption/placement/etc.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
2021-08-20 12:32:58 +02:00
Valentin Schneider 0083242c93 sched/topology: Skip updating masks for non-online nodes
The scheduler currently expects NUMA node distances to be stable from
init onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).

Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.

Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks
when it is first onlined post-init.

Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com
2021-08-20 12:32:57 +02:00
Peter Zijlstra 3c474b3239 sched: Fix Core-wide rq->lock for uninitialized CPUs
Eugene tripped over the case where rq_lock(), as called in a
for_each_possible_cpu() loop came apart because rq->core hadn't been
setup yet.

This is a somewhat unusual, but valid case.

Rework things such that rq->core is initialized to point at itself. IOW
initialize each CPU as a single threaded Core. CPU online will then join
the new CPU (thread) to an existing Core where needed.

For completeness sake, have CPU offline fully undo the state so as to
not presume the topology will match the next time it comes online.

Fixes: 9edeaea1bc ("sched: Core-wide rq->lock")
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lkml.kernel.org/r/YR473ZGeKqMs6kw+@hirez.programming.kicks-ass.net
2021-08-20 12:32:53 +02:00
Sebastian Andrzej Siewior b857174e68 locking/ww_mutex: Initialize waiter.ww_ctx properly
The consolidation of the debug code for mutex waiter intialization sets
waiter::ww_ctx to a poison value unconditionally. For regular mutexes this
is intended to catch the case where waiter_ww_ctx is dereferenced
accidentally.

For ww_mutex the poison value has to be overwritten either with a context
pointer or NULL for ww_mutexes without context.

The rework broke this as it made the store conditional on the context
pointer instead of the argument which signals whether ww_mutex code should
be compiled in or optiized out. As a result waiter::ww_ctx ends up with the
poison pointer for contextless ww_mutexes which causes a later dereference of
the poison pointer because it is != NULL.

Use the build argument instead so for ww_mutex the poison value is always
overwritten.

Fixes: c0afb0ffc0 ("locking/ww_mutex: Gather mutex_waiter initialization")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210819193030.zpwrpvvrmy7xxxiy@linutronix.de
2021-08-20 12:15:54 +02:00
Jakub Kicinski f444fea789 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/ptp/Kconfig:
  55c8fca1da ("ptp_pch: Restore dependency on PCI")
  e5f3155267 ("ethernet: fix PTP_1588_CLOCK dependencies")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-19 18:09:18 -07:00
Prankur Gupta 2c531639de bpf: Add support for {set|get} socket options from setsockopt BPF
Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF
programs. An example use case is when the user sets the IPV6_TCLASS socket
option, we would also like to change the tcp-cc for that socket.

We don't have any use case for calling bpf_setsockopt() from supposedly read-
only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only
at this point.

Signed-off-by: Prankur Gupta <prankgup@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210817224221.3257826-2-prankgup@fb.com
2021-08-20 01:04:52 +02:00
Stanislav Fomichev 44779a4b85 bpf: Use kvmalloc for map keys in syscalls
Same as previous patch but for the keys. memdup_bpfptr is renamed
to kvmemdup_bpfptr (and converted to kvmalloc).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-2-sdf@google.com
2021-08-20 00:09:49 +02:00
Stanislav Fomichev f0dce1d9b7 bpf: Use kvmalloc for map values in syscall
Use kvmalloc/kvfree for temporary value when manipulating a map via
syscall. kmalloc might not be sufficient for percpu maps where the value
is big (and further multiplied by hundreds of CPUs).

Can be reproduced with netcnt test on qemu with "-smp 255".

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com
2021-08-20 00:09:38 +02:00