OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Martin KaFai Lau	8e7ae2518f	bpf: Sanitize the bpf_struct_ops tcp-cc name The bpf_struct_ops tcp-cc name should be sanitized in order to avoid problematic chars (e.g. whitespaces). This patch reuses the bpf_obj_name_cpy() for accepting the same set of characters in order to keep a consistent bpf programming experience. A "size" param is added. Also, the strlen is returned on success so that the caller (like the bpf_tcp_ca here) can error out on empty name. The existing callers of the bpf_obj_name_cpy() only need to change the testing statement to "if (err < 0)". For all these existing callers, the err will be overwritten later, so no extra change is needed for the new strlen return value. v3: - reverse xmas tree style v2: - Save the orig_src to avoid "end - size" (Andrii) Fixes: `0baf26b0fc` ("bpf: tcp: Support tcp_congestion_ops in bpf") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200314010209.1131542-1-kafai@fb.com	2020-03-17 20:40:19 +01:00
Greg Kroah-Hartman	5f3a481324	Merge branch 'for-5.7-console-exit' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk into tty-next We need the console patches in here as well for futher work from Andy. * 'for-5.7-console-exit' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: console: Introduce ->exit() callback console: Don't notify user space when unregister non-listed console console: Avoid positive return code from unregister_console() console: Drop misleading comment console: Use for_each_console() helper in unregister_console() console: Drop double check for console_drivers being non-NULL console: Don't perform test for CON_BRL flag Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-17 15:26:43 +01:00
Thomas Hellstrom	17c4a2ae15	dma-mapping: Fix dma_pgprot() for unencrypted coherent pages When dma_mmap_coherent() sets up a mapping to unencrypted coherent memory under SEV encryption and sometimes under SME encryption, it will actually set up an encrypted mapping rather than an unencrypted, causing devices that DMAs from that memory to read encrypted contents. Fix this. When force_dma_unencrypted() returns true, the linear kernel map of the coherent pages have had the encryption bit explicitly cleared and the page content is unencrypted. Make sure that any additional PTEs we set up to these pages also have the encryption bit cleared by having dma_pgprot() return a protection with the encryption bit cleared in this case. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20200304114527.3636-3-thomas_os@shipmail.org	2020-03-17 11:52:58 +01:00
Daniel Xu	38aca3071c	cgroupfs: Support user xattrs This patch turns on xattr support for cgroupfs. This is useful for letting non-root owners of delegated subtrees attach metadata to cgroups. One use case is for subtree owners to tell a userspace out of memory killer to bias away from killing specific subtrees. Tests: [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice --remove user.name$i; done setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device `seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of errors we expect to see. [/data]# cat testxattr.c #include <sys/types.h> #include <sys/xattr.h> #include <stdio.h> #include <stdlib.h> int main() { char name[256]; char *buf = malloc(64 << 10); if (!buf) { perror("malloc"); return 1; } for (int i = 0; i < 4; ++i) { snprintf(name, 256, "user.bigone%d", i); if (setxattr("/sys/fs/cgroup/system.slice", name, buf, 64 << 10, 0)) { printf("setxattr failed on iteration=%d\n", i); return 1; } } return 0; } [/data]# ./a.out setxattr failed on iteration=2 [/data]# ./a.out setxattr failed on iteration=0 [/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/ [/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/ [/data]# ./a.out setxattr failed on iteration=2 Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-03-16 15:53:47 -04:00
Christoph Hellwig	999a5d1203	dma-direct: provide a arch_dma_clear_uncached hook This allows the arch code to reset the page tables to cached access when freeing a dma coherent allocation that was set to uncached using arch_dma_set_uncached. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2020-03-16 10:48:12 +01:00
Christoph Hellwig	fa7e2247c5	dma-direct: make uncached_kernel_address more general Rename the symbol to arch_dma_set_uncached, and pass a size to it as well as allow an error return. That will allow reusing this hook for in-place pagetable remapping. As the in-place remap doesn't always require an explicit cache flush, also detangle ARCH_HAS_DMA_PREP_COHERENT from ARCH_HAS_DMA_SET_UNCACHED. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2020-03-16 10:48:09 +01:00
Christoph Hellwig	3d0fc341c4	dma-direct: consolidate the error handling in dma_direct_alloc_pages Use a goto label to merge two error return cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>	2020-03-16 10:48:06 +01:00
Kevin Grandemange	286c21de32	dma-coherent: fix integer overflow in the reserved-memory dma allocation pageno is an int and the PAGE_SHIFT shift is done on an int, overflowing if the memory is bigger than 2G This can be reproduced using for example a reserved-memory of 4G reserved-memory { #address-cells = <2>; #size-cells = <2>; ranges; reserved_dma: buffer@0 { compatible = "shared-dma-pool"; no-map; reg = <0x5 0x00000000 0x1 0x0>; }; }; Signed-off-by: Kevin Grandemange <kevin.grandemange@allegrodvt.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-03-16 10:43:02 +01:00
Linus Torvalds	34d5a4b336	Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism whihc causes a 100% performance regression. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj 32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+ hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd 7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN 0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq 6URL6DVTLwDZPg== =e2iB -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fix from Thomas Gleixner: "Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism which causes a 100% performance regression" * tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unbreak futex hashing futex: Fix inode life-time issue	2020-03-15 12:55:52 -07:00
Linus Torvalds	ffe6da91b0	A single fix adding the missing time namespace adjustment in sys/sysinfo which caused sys/sysinfo to be inconsistent with /proc/uptime when read from a task inside a time namespace. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uRM8THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodEtD/9G/4q0bnI9Oyghb1vSFK/cXTUq+gFZ oRpBbcs7GlN8Nj6I51hE7iWqPz+wvKuQQkuTvjewg+7yO+19IecD+SHxu2NQ7PCz W0PHRCqX32coPVeya9fxc5mdqtrG32uFaOfiL3UVGTwmkwfapfOOQgDPDFCf9vcu bBP8YpvQOkU/bqH5sXBCO3u34n9NK6dpQLjcnPSYF+recSUiJVa17F3LVHFcCuXH 5Ck4lY2W+xARIBwapZzz5rey4U8SIMEaHANkSmS11jpg5WB3jUOX80z6zGus15sN haoaxYSDjUwwKUOPrqglL/Dq4DkCVYyPSzyYM3IRaTV/LpH1Fh/e6i6kIjpXtCth d19rcklU8o2TIFw2qCK9V2Z7j71BOF10EYnf0MWNmAF5q2v/D6isQfwDz6sbU4mU TLPCOiAgrGbWAS87ywBCdLjom4sejuL9tEf+O9i0YuqLp95BE2F4zjY/m19r5UNm vCX1XEbga68Gxi+Oy39hhEanSAo6MhK6PEGH8KabAdkC4/ioyE5PzsccQA2lWwQp cvpEMhGyDSzi7BcCw3uwp7DipBMpoIQu54OZ7cSoVeM9qNcj9dh4Y/Z0mF8v+Ibf RucI7P68iloPMw8vWJV8gOm1xl2B7iCKrIs4TVxqf3J+ZzprWmtUPCo75vKo2vlf l4lvX4dTRQB6Fw== =Is++ -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix adding the missing time namespace adjustment in sys/sysinfo which caused sys/sysinfo to be inconsistent with /proc/uptime when read from a task inside a time namespace" * tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sys/sysinfo: Respect boottime inside time namespace	2020-03-15 12:48:21 -07:00
Eric Biggers	fba616a49f	PM / hibernate: Remove unnecessary compat ioctl overrides Since the SNAPSHOT_GET_IMAGE_SIZE, SNAPSHOT_AVAIL_SWAP_SIZE, and SNAPSHOT_ALLOC_SWAP_PAGE ioctls produce an loff_t result, and loff_t is always 64-bit even in the compat case, there's no reason to have the special compat handling for these ioctls. Just remove this unneeded code so that these ioctls call into snapshot_ioctl() directly, doing just the compat_ptr() conversion on the argument. (This unnecessary code was also causing a KMSAN false positive.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-03-14 11:55:08 +01:00
David S. Miller	44ef976ab3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your net-next tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-13 20:52:03 -07:00
Jules Irenge	dcce11d545	bpf: Add missing annotations for __bpf_prog_enter() and __bpf_prog_exit() Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit() warning: context imbalance in __bpf_prog_enter() - wrong count at exit warning: context imbalance in __bpf_prog_exit() - unexpected unlock The root cause is the missing annotation at __bpf_prog_enter() and __bpf_prog_exit() Add the missing __acquires(RCU) annotation Add the missing __releases(RCU) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200311010908.42366-2-jbi.octave@gmail.com	2020-03-13 20:55:07 +01:00
Jiri Olsa	7ac88eba18	bpf: Remove bpf_image tree Now that we have all the objects (bpf_prog, bpf_trampoline, bpf_dispatcher) linked in bpf_tree, there's no need to have separate bpf_image tree for images. Reverting the bpf_image tree together with struct bpf_image, because it's no longer needed. Also removing bpf_image_alloc function and adding the original bpf_jit_alloc_exec_page interface instead. The kernel_text_address function can now rely only on is_bpf_text_address, because it checks the bpf_tree that contains all the objects. Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are useful wrappers with perf's ksymbol interface calls. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:52 -07:00
Jiri Olsa	517b75e44c	bpf: Add dispatchers to kallsyms Adding dispatchers to kallsyms. It's displayed as bpf_dispatcher_<NAME> where NAME is the name of dispatcher. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-12-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:52 -07:00
Jiri Olsa	a108f7dcfa	bpf: Add trampolines to kallsyms Adding trampolines to kallsyms. It's displayed as bpf_trampoline_<ID> [bpf] where ID is the BTF id of the trampoline function. Adding bpf_image_ksym_add/del functions that setup the start/end values and call KSYMBOL perf events handlers. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-11-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:52 -07:00
Jiri Olsa	dba122fb5e	bpf: Add bpf_ksym_add/del functions Separating /proc/kallsyms add/del code and adding bpf_ksym_add/del functions for that. Moving bpf_prog_ksym_node_add/del functions to __bpf_ksym_add/del and changing their argument to 'struct bpf_ksym' object. This way we can call them for other bpf objects types like trampoline and dispatcher. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-10-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:52 -07:00
Jiri Olsa	cbd76f8d5a	bpf: Add prog flag to struct bpf_ksym object Adding 'prog' bool flag to 'struct bpf_ksym' to mark that this object belongs to bpf_prog object. This change allows having bpf_prog objects together with other types (trampolines and dispatchers) in the single bpf_tree. It's used when searching for bpf_prog exception tables by the bpf_prog_ksym_find function, where we need to get the bpf_prog pointer. >From now we can safely add bpf_ksym support for trampoline or dispatcher objects, because we can differentiate them from bpf_prog objects. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-9-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:52 -07:00
Jiri Olsa	eda0c92902	bpf: Add bpf_ksym_find function Adding bpf_ksym_find function that is used bpf bpf address lookup functions: __bpf_address_lookup is_bpf_text_address while keeping bpf_prog_kallsyms_find to be used only for lookup of bpf_prog objects (will happen in following changes). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-8-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:51 -07:00
Jiri Olsa	ca4424c920	bpf: Move ksym_tnode to bpf_ksym Moving ksym_tnode list node to 'struct bpf_ksym' object, so the symbol itself can be chained and used in other objects like bpf_trampoline and bpf_dispatcher. We need bpf_ksym object to be linked both in bpf_kallsyms via lnode for /proc/kallsyms and in bpf_tree via tnode for bpf address lookup functions like __bpf_address_lookup or bpf_prog_kallsyms_find. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-7-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:51 -07:00
Jiri Olsa	ecb60d1c67	bpf: Move lnode list node to struct bpf_ksym Adding lnode list node to 'struct bpf_ksym' object, so the struct bpf_ksym itself can be chained and used in other objects like bpf_trampoline and bpf_dispatcher. Changing iterator to bpf_ksym in bpf_get_kallsym function. The ksym->start is holding the prog->bpf_func value, so it's ok to use it as value in bpf_get_kallsym. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200312195610.346362-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:51 -07:00
Jiri Olsa	bfea9a8574	bpf: Add name to struct bpf_ksym Adding name to 'struct bpf_ksym' object to carry the name of the symbol for bpf_prog, bpf_trampoline, bpf_dispatcher objects. The current benefit is that name is now generated only when the symbol is added to the list, so we don't need to generate it every time it's accessed. The future benefit is that we will have all the bpf objects symbols represented by struct bpf_ksym. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200312195610.346362-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:51 -07:00
Jiri Olsa	535911c80a	bpf: Add struct bpf_ksym Adding 'struct bpf_ksym' object that will carry the kallsym information for bpf symbol. Adding the start and end address to begin with. It will be used by bpf_prog, bpf_trampoline, bpf_dispatcher objects. The symbol_start/symbol_end values were originally used to sort bpf_prog objects. For the address displayed in /proc/kallsyms we are using prog->bpf_func value. I'm using the bpf_func value for program symbol start instead of the symbol_start, because it makes no difference for sorting bpf_prog objects and we can use it directly as an address to display it in /proc/kallsyms. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200312195610.346362-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:51 -07:00
Andrii Nakryiko	9886866836	bpf: Abstract away entire bpf_link clean up procedure Instead of requiring users to do three steps for cleaning up bpf_link, its anon_inode file, and unused fd, abstract that away into bpf_link_cleanup() helper. bpf_link_defunct() is removed, as it shouldn't be needed as an individual operation anymore. v1->v2: - keep bpf_link_cleanup() static for now (Daniel). Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200313002128.2028680-1-andriin@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-03-13 12:49:51 -07:00
David S. Miller	242a6df688	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2020-03-12 The following pull-request contains BPF updates for your net tree. We've added 12 non-merge commits during the last 8 day(s) which contain a total of 12 files changed, 161 insertions(+), 15 deletions(-). The main changes are: 1) Andrii fixed two bugs in cgroup-bpf. 2) John fixed sockmap. 3) Luke fixed x32 jit. 4) Martin fixed two issues in struct_ops. 5) Yonghong fixed bpf_send_signal. 6) Yoshiki fixed BTF enum. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-13 11:13:45 -07:00
David S. Miller	1d34357931	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-12 22:34:48 -07:00
Eelco Chaudron	d831ee84bf	bpf: Add bpf_xdp_output() helper Introduce new helper that reuses existing xdp perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct xdp_buff *' as a tracepoint argument. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial	2020-03-12 17:47:38 -07:00
Carlos Neira	b4490c5c4e	bpf: Added new helper bpf_get_ns_current_pid_tgid New bpf helper bpf_get_ns_current_pid_tgid, This helper will return pid and tgid from current task which namespace matches dev_t and inode number provided, this will allows us to instrument a process inside a container. Signed-off-by: Carlos Neira <cneirabustos@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com	2020-03-12 17:33:11 -07:00
Linus Torvalds	1b51f69461	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: "It looks like a decent sized set of fixes, but a lot of these are one liner off-by-one and similar type changes: 1) Fix netlink header pointer to calcular bad attribute offset reported to user. From Pablo Neira Ayuso. 2) Don't double clear PHY interrupts when ->did_interrupt is set, from Heiner Kallweit. 3) Add missing validation of various (devlink, nl802154, fib, etc.) attributes, from Jakub Kicinski. 4) Missing pos increments in various netfilter seq_next ops, from Vasily Averin. 5) Missing break in of_mdiobus_register() loop, from Dajun Jin. 6) Don't double bump tx_dropped in veth driver, from Jiang Lidong. 7) Work around FMAN erratum A050385, from Madalin Bucur. 8) Make sure ARP header is pulled early enough in bonding driver, from Eric Dumazet. 9) Do a cond_resched() during multicast processing of ipvlan and macvlan, from Mahesh Bandewar. 10) Don't attach cgroups to unrelated sockets when in interrupt context, from Shakeel Butt. 11) Fix tpacket ring state management when encountering unknown GSO types. From Willem de Bruijn. 12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend() only in the suspend context. From Heiner Kallweit" git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) net: systemport: fix index check to avoid an array out of bounds access tc-testing: add ETS scheduler to tdc build configuration net: phy: fix MDIO bus PM PHY resuming net: hns3: clear port base VLAN when unload PF net: hns3: fix RMW issue for VLAN filter switch net: hns3: fix VF VLAN table entries inconsistent issue net: hns3: fix "tc qdisc del" failed issue taprio: Fix sending packets without dequeueing them net: mvmdio: avoid error message for optional IRQ net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register net: memcg: fix lockdep splat in inet_csk_accept() s390/qeth: implement smarter resizing of the RX buffer pool s390/qeth: refactor buffer pool code s390/qeth: use page pointers to manage RX buffer pool seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed net/packet: tpacket_rcv: do not increment ring index on drop sxgbe: Fix off by one in samsung driver strncpy size arg net: caif: Add lockdep expression to RCU traversal primitive MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer ...	2020-03-12 16:19:19 -07:00
Tejun Heo	e7b20d9796	cgroup: Restructure release_agent_path handling cgrp->root->release_agent_path is protected by both cgroup_mutex and release_agent_path_lock and readers can hold either one. The dual-locking scheme was introduced while breaking a locking dependency issue around cgroup_mutex but doesn't make sense anymore given that the only remaining reader which uses cgroup_mutex is cgroup1_releaes_agent(). This patch updates cgroup1_release_agent() to use release_agent_path_lock so that release_agent_path is always protected only by release_agent_path_lock. While at it, convert strlen() based empty string checks to direct tests on the first character as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2020-03-12 16:44:35 -04:00
Tejun Heo	a09833f7cd	Merge branch 'for-5.6-fixes' into for-5.7	2020-03-12 16:44:18 -04:00
Chris Wilson	00d5d15b06	workqueue: Mark up unlocked access to wq->first_flusher [ 7329.671518] BUG: KCSAN: data-race in flush_workqueue / flush_workqueue [ 7329.671549] [ 7329.671572] write to 0xffff8881f65fb250 of 8 bytes by task 37173 on cpu 2: [ 7329.671607] flush_workqueue+0x3bc/0x9b0 (kernel/workqueue.c:2844) [ 7329.672527] [ 7329.672540] read to 0xffff8881f65fb250 of 8 bytes by task 37175 on cpu 0: [ 7329.672571] flush_workqueue+0x28d/0x9b0 (kernel/workqueue.c:2835) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-03-12 14:26:50 -04:00
Richard Guy Briggs	1320a4052e	audit: trigger accompanying records when no rules present When there are no audit rules registered, mandatory records (config, etc.) are missing their accompanying records (syscall, proctitle, etc.). This is due to audit context dummy set on syscall entry based on absence of rules that signals that no other records are to be printed. Clear the dummy bit if any record is generated. The proctitle context and dummy checks are pointless since the proctitle record will not be printed if no syscall records are printed. Please see upstream github issue https://github.com/linux-audit/audit-kernel/issues/120 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2020-03-12 10:42:51 -04:00
Linus Torvalds	addcb1d0ee	for-linus-2020-03-10 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXmebNQAKCRCRxhvAZXjc ohidAP4y7sujHKMe87Qd6RFQ+aPTB1cGVgBSyMV5DuvbTW0R9QEA/bWSUtye5+Ln WRDkXapGM2l36s02xspgokaAhYiFoAE= =a6eO -----END PGP SIGNATURE----- Merge tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fix from Christian Brauner: "This contains a single fix for a regression which was introduced when we introduced the ability to select a specific pid at process creation time. When this feature is requested, the error value will be set to -EPERM after exiting the pid allocation loop. This caused EPERM to be returned when e.g. the init process/child subreaper of the pid namespace has already died where we used to return ENOMEM before. The first patch here simply fixes the regression by unconditionally setting the return value back to ENOMEM again once we've successfully allocated the requested pid number. This should be easy to backport to v5.5. The second patch adds a comment explaining that we must keep returning ENOMEM since we've been doing it for a long time and have explicitly documented this behavior for userspace. This seemed worthwhile because we now have at least two separate example where people tried to change the return value to something other than ENOMEM (The first version of the regression fix did that too and the commit message links to an earlier patch that tried to do the same.). I have a simple regression test to make sure we catch this regression in the future but since that introduces a whole new selftest subdir and test files I'll keep this for v5.7" * tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: pid: make ENOMEM return value more obvious pid: Fix error return value in some cases	2020-03-11 10:00:41 -07:00
Linus Torvalds	36feb99630	Have ftrace lookup_rec() return a consistent record otherwise it can break live patching. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXmj5ExQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quQGAQDO35RBAQDGmpxnSCQPNwrzqokM8p8d 1e1xshwOVnwqgAEA7csC4u1n5Z8ncIl5Pd8ygt4nXeqw4AenHLeZIdhfegc= =+AeW -----END PGP SIGNATURE----- Merge tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fix from Steven Rostedt: "Have ftrace lookup_rec() return a consistent record otherwise it can break live patching" * tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Return the first found result in lookup_rec()	2020-03-11 09:54:59 -07:00
Artem Savkov	d9815bff6b	ftrace: Return the first found result in lookup_rec() It appears that ip ranges can overlap so. In that case lookup_rec() returns whatever results it got last even if it found nothing in last searched page. This breaks an obscure livepatch late module patching usecase: - load livepatch - load the patched module - unload livepatch - try to load livepatch again To fix this return from lookup_rec() as soon as it found the record containing searched-for ip. This used to be this way prior lookup_rec() introduction. Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com Cc: stable@vger.kernel.org Fixes: `7e16f581a8` ("ftrace: Separate out functionality from ftrace_location_range()") Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-03-11 10:37:12 -04:00
Andrii Nakryiko	babf316409	bpf: Add bpf_link_new_file that doesn't install FD Add bpf_link_new_file() API for cases when we need to ensure anon_inode is successfully created before we proceed with expensive BPF program attachment procedure, which will require equally (if not more so) expensive and potentially failing compensation detachment procedure just because anon_inode creation failed. This API allows to simplify code by ensuring first that anon_inode is created and after BPF program is attached proceed with fd_install() that can't fail. After anon_inode file is created, link can't be just kfree()'d anymore, because its destruction will be performed by deferred file_operations->release call. For this, bpf_link API required specifying two separate operations: release() and dealloc(), former performing detachment only, while the latter frees memory used by bpf_link itself. dealloc() needs to be specified, because struct bpf_link is frequently embedded into link type-specific container struct (e.g., struct bpf_raw_tp_link), so bpf_link itself doesn't know how to properly free the memory. In case when anon_inode file was successfully created, but subsequent BPF attachment failed, bpf_link needs to be marked as "defunct", so that file's release() callback will perform only memory deallocation, but no detachment. Convert raw tracepoint and tracing attachment to new API and eliminate detachment from error handling path. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200309231051.1270337-1-andriin@fb.com	2020-03-11 14:02:48 +01:00
Shakeel Butt	e876ecc67d	cgroup: memcg: net: do not associate sock with unrelated cgroup We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: `2d75807383` ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: `d979a39d72` ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-10 15:33:05 -07:00
Linus Torvalds	e941484541	Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cgroup.procs listing related fixes. It didn't interlock properly with exiting tasks leaving a short window where a cgroup has empty cgroup.procs but still can't be removed and misbehaved on short reads. - psi_show() crash fix on 32bit ino archs - Empty release_agent handling fix * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup1: don't call release_agent when it is "" cgroup: fix psi_show() crash on 32bit ino archs cgroup: Iterate tasks that did not finish do_exit() cgroup: cgroup_procs_next should increase position index cgroup-v1: cgroup_pidlist_next should update position index	2020-03-10 15:05:45 -07:00
Linus Torvalds	2c1aca4bd3	Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Workqueue has been incorrectly round-robining per-cpu work items. Hillf's patch fixes that. The other patch documents memory-ordering properties of workqueue operations" * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: don't use wq_select_unbound_cpu() for bound works workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()	2020-03-10 14:48:22 -07:00
Yoshiki Komachi	da6c7faeb1	bpf/btf: Fix BTF verification of enum members in struct/union btf_enum_check_member() was currently sure to recognize the size of "enum" type members in struct/union as the size of "int" even if its size was packed. This patch fixes BTF enum verification to use the correct size of member in BPF programs. Fixes: `179cde8cef` ("bpf: btf: Check members of struct/union") Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1583825550-18606-2-git-send-email-komachi.yoshiki@gmail.com	2020-03-10 10:00:41 -07:00
Hillf Danton	aa202f1f56	workqueue: don't use wq_select_unbound_cpu() for bound works wq_select_unbound_cpu() is designed for unbound workqueues only, but it's wrongly called when using a bound workqueue too. Fixing this ensures work queued to a bound workqueue with cpu=WORK_CPU_UNBOUND always runs on the local CPU. Before, that would happen only if wq_unbound_cpumask happened to include it (likely almost always the case), or was empty, or we got lucky with forced round-robin placement. So restricting /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's CPUs would cause some bound work items to run unexpectedly there. Fixes: `ef55718044` ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Hillf Danton <hdanton@sina.com> [dj: massage changelog] Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>	2020-03-10 10:30:51 -04:00
Greg Kroah-Hartman	cb05c6c82f	Merge 5.6-rc5 into tty-next We need the vt fixes in here and it resolves a merge issue with drivers/tty/vt/selection.c Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-10 10:02:49 +01:00
Andrii Nakryiko	1d8006abaa	bpf: Fix cgroup ref leak in cgroup_bpf_inherit on out-of-memory There is no compensating cgroup_bpf_put() for each ancestor cgroup in cgroup_bpf_inherit(). If compute_effective_progs returns error, those cgroups won't be freed ever. Fix it by putting them in cleanup code path. Fixes: `e10360f815` ("bpf: cgroup: prevent out-of-order release of cgroup bpf") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20200309224017.1063297-1-andriin@fb.com	2020-03-09 19:58:54 -07:00
Andrii Nakryiko	62039c30c1	bpf: Initialize storage pointers to NULL to prevent freeing garbage pointer Local storage array isn't initialized, so if cgroup storage allocation fails for BPF_CGROUP_STORAGE_SHARED, error handling code will attempt to free uninitialized pointer for BPF_CGROUP_STORAGE_PERCPU storage type. Avoid this by always initializing storage pointers to NULLs. Fixes: `8bad74f984` ("bpf: extend cgroup bpf core to allow multiple cgroup storage types") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200309222756.1018737-1-andriin@fb.com	2020-03-09 19:56:48 -07:00
Christian Brauner	10dab84caf	pid: make ENOMEM return value more obvious The alloc_pid() codepath used to be simpler. With the introducation of the ability to choose specific pids in `49cb2fc42c` ("fork: extend clone3() to support setting a PID") it got more complex. It hasn't been super obvious that ENOMEM is returned when the pid namespace init process/child subreaper of the pid namespace has died. As can be seen from multiple attempts to improve this see e.g. [1] and most recently [2]. We regressed returning ENOMEM in [3] and [2] restored it. Let's add a comment on top explaining that this is historic and documented behavior and cannot easily be changed. [1]: `35f71bc0a0` ("fork: report pid reservation failure properly") [2]: `b26ebfe12f` ("pid: Fix error return value in some cases") [3]: `49cb2fc42c` ("fork: extend clone3() to support setting a PID") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-03-09 23:40:05 +01:00
Thomas Gleixner	8d67743653	futex: Unbreak futex hashing The recent futex inode life time fix changed the ordering of the futex key union struct members, but forgot to adjust the hash function accordingly, As a result the hashing omits the leading 64bit and even hashes beyond the futex key causing a bad hash distribution which led to a ~100% performance regression. Hand in the futex key pointer instead of a random struct member and make the size calculation based of the struct offset. Fixes: `8019ad13ef` ("futex: Fix inode life-time issue") Reported-by: Rong Chen <rong.a.chen@intel.com> Decoded-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Rong Chen <rong.a.chen@intel.com> Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de	2020-03-09 22:33:09 +01:00
Corey Minyard	b26ebfe12f	pid: Fix error return value in some cases Recent changes to alloc_pid() allow the pid number to be specified on the command line. If set_tid_size is set, then the code scanning the levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM value. After the code scanning the levels, there are error returns that do not set retval, assuming it is still set to -ENOMEM. So set retval back to -ENOMEM after scanning the levels. Fixes: `49cb2fc42c` ("fork: extend clone3() to support setting a PID") Signed-off-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <areber@redhat.com> Cc: <stable@vger.kernel.org> # 5.5 Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org [christian.brauner@ubuntu.com: fixup commit message] Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-03-08 14:22:58 +01:00
Alexander Sverdlin	87f2d1c662	genirq/irqdomain: Check pointer in irq_domain_alloc_irqs_hierarchy() irq_domain_alloc_irqs_hierarchy() has 3 call sites in the compilation unit but only one of them checks for the pointer which is being dereferenced inside the called function. Move the check into the function. This allows for catching the error instead of the following crash: Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at 0x0 LR is at gpiochip_hierarchy_irq_domain_alloc+0x11f/0x140 ... [<c06c23ff>] (gpiochip_hierarchy_irq_domain_alloc) [<c0462a89>] (__irq_domain_alloc_irqs) [<c0462dad>] (irq_create_fwspec_mapping) [<c06c2251>] (gpiochip_to_irq) [<c06c1c9b>] (gpiod_to_irq) [<bf973073>] (gpio_irqs_init [gpio_irqs]) [<bf974048>] (gpio_irqs_exit+0xecc/0xe84 [gpio_irqs]) Code: bad PC value Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200306174720.82604-1-alexander.sverdlin@nokia.com	2020-03-08 13:03:41 +01:00
Thomas Gleixner	acd26bcf36	genirq: Provide interrupt injection mechanism Error injection mechanisms need a half ways safe way to inject interrupts as invoking generic_handle_irq() or the actual device interrupt handler directly from e.g. a debugfs write is not guaranteed to be safe. On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which is the base of x86 interrupt delivery and affinity management. Move the irq debugfs injection code into a separate function which can be used by error injection code as well. The implementation prevents at least that state is corrupted, but it cannot close a very tiny race window on x86 which might result in a stale and not serviced device interrupt under very unlikely circumstances. This is explicitly for debugging and testing and not for production use or abuse in random driver code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de	2020-03-08 11:06:42 +01:00
Thomas Gleixner	da90921acc	genirq: Sanitize state handling in check_irq_resend() The code sets IRQS_REPLAY unconditionally whether the resend happens or not. That doesn't have bad side effects right now, but inconsistent state is always a latent source of problems. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.882129117@linutronix.de	2020-03-08 11:06:41 +01:00
Thomas Gleixner	1f85b1f5e1	genirq: Add return value to check_irq_resend() In preparation for an interrupt injection interface which can be used safely by error injection mechanisms. e.g. PCI-E/ AER, add a return value to check_irq_resend() so errors can be propagated to the caller. Split out the software resend code so the ugly #ifdef in check_irq_resend() goes away and the whole thing becomes readable. Fix up the caller in debugfs. The caller in irq_startup() does not care about the return value as this is unconditionally invoked for all interrupts and the resend is best effort anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.775200917@linutronix.de	2020-03-08 11:06:41 +01:00
Thomas Gleixner	c16816acd0	genirq: Add protection against unsafe usage of generic_handle_irq() In general calling generic_handle_irq() with interrupts disabled from non interrupt context is harmless. For some interrupt controllers like the x86 trainwrecks this is outright dangerous as it might corrupt state if an interrupt affinity change is pending. Add infrastructure which allows to mark interrupts as unsafe and catch such usage in generic_handle_irq(). Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.590923677@linutronix.de	2020-03-08 11:06:40 +01:00
Thomas Gleixner	a740a423c3	genirq/debugfs: Add missing sanity checks to interrupt injection Interrupts cannot be injected when the interrupt is not activated and when a replay is already in progress. Fixes: `536e2e34bd` ("genirq/debugfs: Triggering of interrupts from userspace") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200306130623.500019114@linutronix.de	2020-03-08 11:06:40 +01:00
luanshi	b513df6780	irqdomain: Fix function documentation of __irq_domain_alloc_fwnode() The function got renamed at some point, but the kernel-doc was not updated. Signed-off-by: luanshi <zhangliguang@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1583200125-58806-1-git-send-email-zhangliguang@linux.alibaba.com	2020-03-08 11:02:24 +01:00
Linus Torvalds	5dfcc13902	block-5.6-2020-03-07 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8hwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnjID/4/XVrqtVNUzVoVOtkOyxyesBrJVMHEQEpJ PZssv835IStw0ENhxQJfGjPaIFc9Ff6PMkeN5KRAlMoEc+NkrJShF3owGf+6Bps7 rxpblPxaw+CJFa31YBDZVjMCvbVkDm40G5SsJh+xzdIjlWz7MppkkMPdrErPwY8V 0vnrIc+mKBKfBMZTwVkycYtp17LVgfXguledoWzxM1y47IW5UasKh8jdzhbu8Hvt zztdQrigUdb+9XnLGCZIY0JQOyrhJ5zQpZ40FzbvxdYrQZXOoYT8L7iFu/z0Wi7K p3a+G+B4WowtLYW78me4Uut5RrHq2XOehSypfujanQlpgXPGjS3TdHT3an2T8XPQ NyGsZsn/eLm3btNbhGUd8vqpQy5EmWhqmwvYk9tFAoSFLiLcvCC624b/TCYPL+gk 3ZiI7mXBMjHnUZ0J/RF6kZWTAZDvr/tE7UZt1f8r1eEr8VDzCNp5Pst+HCVIguYD g9eWF8oH6wYoj39UKf1k+vW2GjXGFsnfivObaxhyz03sAPXK2wQlzAe/4jZ24XNr TRtOXh97c3CbLAwdUHehlzzdR3U7h0n2KsmrTC5AGmLABmR79s7BJ0+pexuZituO LwU8+gpf7AugHTrLg1eNXAmBHW44I1ticXYiWcT4iSPn99kNIhlW+Jb1iTGoiu7n nXyS3b5SCw== =xwKl -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Here are a few fixes that should go into this release. This contains: - Revert of a bad bcache patch from this merge window - Removed unused function (Daniel) - Fixup for the blktrace fix from Jan from this release (Cengiz) - Fix of deeper level bfqq overwrite in BFQ (Carlo)" * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block: block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() blktrace: fix dereference after null check Revert "bcache: ignore pending signals when creating gc and allocator thread" block: Remove used kblockd_schedule_work_on()	2020-03-07 14:14:38 -06:00
Linus Torvalds	fa883d6afb	for-linus-2020-03-07 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXmNvpgAKCRCRxhvAZXjc ouFvAQDCzfOx1vcEP/nNhYBP2MPuafKclJcoJggC9rSmIvcLiQD/TI+LyHzplD+m MWSu9NZJ6h6qyjKJivja3/bs8DVEewU= =4gyS -----END PGP SIGNATURE----- Merge tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux Pull thread fixes from Christian Brauner: "Here are a few hopefully uncontroversial fixes: - Use RCU_INIT_POINTER() when initializing rcu protected members in task_struct to fix sparse warnings. - Add pidfd_fdinfo_test binary to .gitignore file" * tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux: selftests: pidfd: Add pidfd_fdinfo_test in .gitignore exit: Fix Sparse errors and warnings fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()	2020-03-07 08:01:43 -06:00
Dmitry Safonov	eaee41727e	sysctl/sysrq: Remove __sysrq_enabled copy Many embedded boards have a disconnected TTL level serial which can generate some garbage that can lead to spurious false sysrq detects. Currently, sysrq can be either completely disabled for serial console or always disabled (with CONFIG_MAGIC_SYSRQ_SERIAL), since commit `732dbf3a61` ("serial: do not accept sysrq characters via serial port") At Arista, we have such boards that can generate BREAK and random garbage. While disabling sysrq for serial console would solve the problem with spurious false sysrq triggers, it's also desirable to have a way to enable sysrq back. Having the way to enable sysrq was beneficial to debug lockups with a manual investigation in field and on the other side preventing false sysrq detections. As a preparation to add sysrq_toggle_support() call into uart, remove a private copy of sysrq_enabled from sysctl - it should reflect the actual status of sysrq. Furthermore, the private copy isn't correct already in case sysrq_always_enabled is true. So, remove __sysrq_enabled and use a getter-helper sysrq_mask() to check sysrq_key_op enabled status. Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://lore.kernel.org/r/20200302175135.269397-2-dima@arista.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-07 09:52:02 +01:00
Ingo Molnar	14533a16c4	thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code drivers/base/arch_topology.c is only built if CONFIG_GENERIC_ARCH_TOPOLOGY=y, resulting in such build failures: cpufreq_cooling.c:(.text+0x1e7): undefined reference to `arch_set_thermal_pressure' Move it to sched/core.c instead, and keep it enabled on x86 despite us not having a arch_scale_thermal_pressure() facility there, to build-test this thing. Cc: Thara Gopinath <thara.gopinath@linaro.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-03-06 14:26:31 +01:00
Peter Xu	fd3eafda8f	sched/core: Remove rq.hrtick_csd_pending Now smp_call_function_single_async() provides the protection that we'll return with -EBUSY if the csd object is still pending, then we don't need the rq.hrtick_csd_pending any more. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20191216213125.9536-4-peterx@redhat.com	2020-03-06 13:42:28 +01:00
Peter Xu	5a18ceca63	smp: Allow smp_call_function_single_async() to insert locked csd Previously we will raise an warning if we want to insert a csd object which is with the LOCK flag set, and if it happens we'll also wait for the lock to be released. However, this operation does not match perfectly with how the function is named - the name with "_async" suffix hints that this function should not block, while we will. This patch changed this behavior by simply return -EBUSY instead of waiting, at the meantime we allow this operation to happen without warning the user to change this into a feature when the caller wants to "insert a csd object, if it's there, just wait for that one". This is pretty safe because in flush_smp_call_function_queue() for async csd objects (where csd->flags&SYNC is zero) we'll first do the unlock then we call the csd->func(). So if we see the csd->flags&LOCK is true in smp_call_function_single_async(), then it's guaranteed that csd->func() will be called after this smp_call_function_single_async() returns -EBUSY. Update the comment of the function too to refect this. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20191216213125.9536-2-peterx@redhat.com	2020-03-06 13:42:28 +01:00
Qais Yousef	d94a9df490	sched/rt: Remove unnecessary push for unfit tasks In task_woken_rt() and switched_to_rto() we try trigger push-pull if the task is unfit. But the logic is found lacking because if the task was the only one running on the CPU, then rt_rq is not in overloaded state and won't trigger a push. The necessity of this logic was under a debate as well, a summary of the discussion can be found in the following thread: https://lore.kernel.org/lkml/20200226160247.iqvdakiqbakk2llz@e107158-lin.cambridge.arm.com/ Remove the logic for now until a better approach is agreed upon. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `804d402fb6` ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-6-qais.yousef@arm.com	2020-03-06 12:57:29 +01:00
Qais Yousef	98ca645f82	sched/rt: Allow pulling unfitting task When implemented RT Capacity Awareness; the logic was done such that if a task was running on a fitting CPU, then it was sticky and we would try our best to keep it there. But as Steve suggested, to adhere to the strict priority rules of RT class; allow pulling an RT task to unfitting CPU to ensure it gets a chance to run ASAP. LINK: https://lore.kernel.org/lkml/20200203111451.0d1da58f@oasis.local.home/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `804d402fb6` ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-5-qais.yousef@arm.com	2020-03-06 12:57:28 +01:00
Qais Yousef	a1bd02e1f2	sched/rt: Optimize cpupri_find() on non-heterogenous systems By introducing a new cpupri_find_fitness() function that takes the fitness_fn as an argument and only called when asym_system static key is enabled. cpupri_find() is now a wrapper function that calls cpupri_find_fitness() passing NULL as a fitness_fn, hence disabling the logic that handles fitness by default. LINK: https://lore.kernel.org/lkml/c0772fca-0a4b-c88d-fdf2-5715fcf8447b@arm.com/ Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `804d402fb6` ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-4-qais.yousef@arm.com	2020-03-06 12:57:27 +01:00
Qais Yousef	b28bc1e002	sched/rt: Re-instate old behavior in select_task_rq_rt() When RT Capacity Aware support was added, the logic in select_task_rq_rt was modified to force a search for a fitting CPU if the task currently doesn't run on one. But if the search failed, and the search was only triggered to fulfill the fitness request; we could end up selecting a new CPU unnecessarily. Fix this and re-instate the original behavior by ensuring we bail out in that case. This behavior change only affected asymmetric systems that are using util_clamp to implement capacity aware. None asymmetric systems weren't affected. LINK: https://lore.kernel.org/lkml/20200218041620.GD28029@codeaurora.org/ Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `804d402fb6` ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-3-qais.yousef@arm.com	2020-03-06 12:57:27 +01:00
Qais Yousef	d9cb236b94	sched/rt: cpupri_find: Implement fallback mechanism for !fit case When searching for the best lowest_mask with a fitness_fn passed, make sure we record the lowest_level that returns a valid lowest_mask so that we can use that as a fallback in case we fail to find a fitting CPU at all levels. The intention in the original patch was not to allow a down migration to unfitting CPU. But this missed the case where we are already running on unfitting one. With this change now RT tasks can still move between unfitting CPUs when they're already running on such CPU. And as Steve suggested; to adhere to the strict priority rules of RT, if a task is already running on a fitting CPU but due to priority it can't run on it, allow it to downmigrate to unfitting CPU so it can run. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `804d402fb6` ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-2-qais.yousef@arm.com Link: https://lore.kernel.org/lkml/20200203142712.a7yvlyo2y3le5cpn@e107158-lin/	2020-03-06 12:57:26 +01:00
Vincent Guittot	5ab297bab9	sched/fair: Fix reordering of enqueue/dequeue_task_fair() Even when a cgroup is throttled, the group se of a child cgroup can still be enqueued and its gse->on_rq stays true. When a task is enqueued on such child, we still have to update the load_avg and increase h_nr_running of the throttled cfs. Nevertheless, the 1st for_each_sched_entity() loop is skipped because of gse->on_rq == true and the 2nd loop because the cfs is throttled whereas we have to update both load_avg with the old h_nr_running and increase h_nr_running in such case. The same sequence can happen during dequeue when se moves to parent before breaking in the 1st loop. Note that the update of load_avg will effectively happen only once in order to sync up to the throttled time. Next call for updating load_avg will stop early because the clock stays unchanged. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `6d4d22468d` ("sched/fair: Reorder enqueue/dequeue_task_fair path") Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org	2020-03-06 12:57:25 +01:00
Vincent Guittot	6212437f0f	sched/fair: Fix runnable_avg for throttled cfs When a cfs_rq is throttled, its group entity is dequeued and its running tasks are removed. We must update runnable_avg with the old h_nr_running and update group_se->runnable_weight with the new h_nr_running at each level of the hierarchy. Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `9f68395333` ("sched/pelt: Add a new runnable average signal") Link: https://lkml.kernel.org/r/20200227154115.8332-1-vincent.guittot@linaro.org	2020-03-06 12:57:25 +01:00
Yu Chen	ba4f7bc1de	sched/deadline: Make two functions static Since commit `06a76fe08d` ("sched/deadline: Move DL related code from sched/core.c to sched/deadline.c"), DL related code moved to deadline.c. Make the following two functions static since they're only used in deadline.c: dl_change_utilization() init_dl_rq_bw_ratio() Signed-off-by: Yu Chen <chen.yu@easystack.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200228100329.16927-1-chen.yu@easystack.cn	2020-03-06 12:57:24 +01:00
Valentin Schneider	38502ab4bf	sched/topology: Don't enable EAS on SMT systems EAS already requires asymmetric CPU capacities to be enabled, and mixing this with SMT is an aberration, but better be safe than sorry. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Quentin Perret <qperret@google.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200227191433.31994-2-valentin.schneider@arm.com	2020-03-06 12:57:23 +01:00
Mel Gorman	0621df3154	sched/numa: Acquire RCU lock for checking idle cores during NUMA balancing Qian Cai reported the following bug: The linux-next commit `ff7db0bf24` ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") introduced a boot warning, [ 86.520534][ T1] WARNING: suspicious RCU usage [ 86.520540][ T1] 5.6.0-rc3-next-20200227 #7 Not tainted [ 86.520545][ T1] ----------------------------- [ 86.520551][ T1] kernel/sched/fair.c:5914 suspicious rcu_dereference_check() usage! [ 86.520555][ T1] [ 86.520555][ T1] other info that might help us debug this: [ 86.520555][ T1] [ 86.520561][ T1] [ 86.520561][ T1] rcu_scheduler_active = 2, debug_locks = 1 [ 86.520567][ T1] 1 lock held by systemd/1: [ 86.520571][ T1] #0: ffff8887f4b14848 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x1d2/0x998 [ 86.520594][ T1] [ 86.520594][ T1] stack backtrace: [ 86.520602][ T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.6.0-rc3-next-20200227 #7 task_numa_migrate() checks for idle cores when updating NUMA-related statistics. This relies on reading a RCU-protected structure in test_idle_cores() via this call chain task_numa_migrate -> update_numa_stats -> numa_idle_core -> test_idle_cores While the locking could be fine-grained, it is more appropriate to acquire the RCU lock for the entire scan of the domain. This patch removes the warning triggered at boot time. Reported-by: Qian Cai <cai@lca.pw> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `ff7db0bf24` ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Link: https://lkml.kernel.org/r/20200227191804.GJ3818@techsingularity.net	2020-03-06 12:57:22 +01:00
Valentin Schneider	76c389ab2b	sched/fair: Fix kernel build warning in test_idle_cores() for !SMT NUMA Building against the tip/sched/core as `ff7db0bf24` ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") with the arm64 defconfig (which doesn't have CONFIG_SCHED_SMT set) leads to: kernel/sched/fair.c:1525:20: warning: 'test_idle_cores' declared 'static' but never defined [-Wunused-function] static inline bool test_idle_cores(int cpu, bool def); ^~~~~~~~~~~~~~~ Rather than define it in its own CONFIG_SCHED_SMT #define island, bunch it up with test_idle_cores(). Reported-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> [mgorman@techsingularity.net: Edit changelog, minor style change] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `ff7db0bf24` ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Link: https://lkml.kernel.org/r/20200303110258.1092-3-mgorman@techsingularity.net	2020-03-06 12:57:22 +01:00
Thara Gopinath	05289b90c2	sched/fair: Enable tuning of decay period Thermal pressure follows pelt signals which means the decay period for thermal pressure is the default pelt decay period. Depending on SoC characteristics and thermal activity, it might be beneficial to decay thermal pressure slower, but still in-tune with the pelt signals. One way to achieve this is to provide a command line parameter to set a decay shift parameter to an integer between 0 and 10. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-10-thara.gopinath@linaro.org	2020-03-06 12:57:21 +01:00
Thara Gopinath	467b7d01c4	sched/fair: Update cpu_capacity to reflect thermal pressure cpu_capacity initially reflects the maximum possible capacity of a CPU. Thermal pressure on a CPU means this maximum possible capacity is unavailable due to thermal events. This patch subtracts the average thermal pressure for a CPU from its maximum possible capacity so that cpu_capacity reflects the remaining maximum capacity. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-8-thara.gopinath@linaro.org	2020-03-06 12:57:20 +01:00
Thara Gopinath	b4eccf5f8e	sched/fair: Enable periodic update of average thermal pressure Introduce support in scheduler periodic tick and other CFS bookkeeping APIs to trigger the process of computing average thermal pressure for a CPU. Also consider avg_thermal.load_avg in others_have_blocked which allows for decay of pelt signals. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-7-thara.gopinath@linaro.org	2020-03-06 12:57:20 +01:00
Thara Gopinath	765047932f	sched/pelt: Add support to track thermal pressure Extrapolating on the existing framework to track rt/dl utilization using pelt signals, add a similar mechanism to track thermal pressure. The difference here from rt/dl utilization tracking is that, instead of tracking time spent by a CPU running a RT/DL task through util_avg, the average thermal pressure is tracked through load_avg. This is because thermal pressure signal is weighted time "delta" capacity unlike util_avg which is binary. "delta capacity" here means delta between the actual capacity of a CPU and the decreased capacity a CPU due to a thermal event. In order to track average thermal pressure, a new sched_avg variable avg_thermal is introduced. Function update_thermal_load_avg can be called to do the periodic bookkeeping (accumulate, decay and average) of the thermal pressure. Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-2-thara.gopinath@linaro.org	2020-03-06 12:57:17 +01:00
Chris Wilson	f1dfdab694	sched/vtime: Prevent unstable evaluation of WARN(vtime->state) As the vtime is sampled under loose seqcount protection by kcpustat, the vtime fields may change as the code flows. Where logic dictates a field has a static value, use a READ_ONCE. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: `74722bb223` ("sched/vtime: Bring up complete kcpustat accessor") Link: https://lkml.kernel.org/r/20200123180849.28486-1-frederic@kernel.org	2020-03-06 12:57:16 +01:00
Ingo Molnar	1b10d388d0	Merge branch 'linus' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-03-06 12:49:56 +01:00
Ian Rogers	95ed6c707f	perf/cgroup: Order events in RB tree by cgroup id If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will hold 120 events. The scheduling in of the events currently iterates over all events looking to see which events match the task's cgroup or its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114 events are considered unnecessarily. This change orders events in the RB tree by cgroup id if it is present. This means scheduling in may go directly to events associated with the task's cgroup if one is present. The per-CPU iterator storage in visit_groups_merge is sized sufficent for an iterator per cgroup depth, where different iterators are needed for the task's cgroup and parent cgroups. By considering the set of iterators when visiting, the lowest group_index event may be selected and the insertion order group_index property is maintained. This also allows event rotation to function correctly, as although events are grouped into a cgroup, rotation always selects the lowest group_index event to rotate (delete/insert into the tree) and the min heap of iterators make it so that the group_index order is maintained. Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com	2020-03-06 11:57:01 +01:00
Ian Rogers	c2283c9368	perf/cgroup: Grow per perf_cpu_context heap storage Allow the per-CPU min heap storage to have sufficient space for per-cgroup iterators. Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200214075133.181299-6-irogers@google.com	2020-03-06 11:57:00 +01:00
Ian Rogers	836196beb3	perf/core: Add per perf_cpu_context min_heap storage The storage required for visit_groups_merge's min heap needs to vary in order to support more iterators, such as when multiple nested cgroups' events are being visited. This change allows for 2 iterators and doesn't support growth. Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com	2020-03-06 11:57:00 +01:00
Ian Rogers	6eef8a7116	perf/core: Use min_heap in visit_groups_merge() visit_groups_merge will pick the next event based on when it was inserted in to the context (perf_event group_index). Events may be per CPU or for any CPU, but in the future we'd also like to have per cgroup events to avoid searching all events for the events to schedule for a cgroup. Introduce a min heap for the events that maintains a property that the earliest inserted event is always at the 0th element. Initialize the heap with per-CPU and any-CPU events for the context. Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200214075133.181299-4-irogers@google.com	2020-03-06 11:56:59 +01:00
Peter Zijlstra	98add2af89	perf/cgroup: Reorder perf_cgroup_connect() Move perf_cgroup_connect() after perf_event_alloc(), such that we can find/use the PMU's cpu context. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200214075133.181299-2-irogers@google.com	2020-03-06 11:56:58 +01:00
Peter Zijlstra	2c2366c754	perf/core: Remove 'struct sched_in_data' We can deduce the ctx and cpuctx from the event, no need to pass them along. Remove the structure and pass in can_add_hw directly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-03-06 11:56:58 +01:00
Peter Zijlstra	ab6f824cfd	perf/core: Unify {pinned,flexible}_sched_in() Less is more; unify the two very nearly identical function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-03-06 11:56:55 +01:00
Ingo Molnar	1941011a8b	Merge branch 'perf/urgent' into perf/core, to pick up the latest fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-03-06 11:56:40 +01:00
Peter Zijlstra	4b39f99c22	futex: Remove {get,drop}_futex_key_refs() Now that {get,drop}_futex_key_refs() have become a glorified NOP, remove them entirely. The only thing get_futex_key_refs() is still doing is an smp_mb(), and now that we don't need to (ab)use existing atomic ops to obtain them, we can place it explicitly where we need it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2020-03-06 11:06:19 +01:00
Peter Zijlstra	222993395e	futex: Remove pointless mmgrap() + mmdrop() We always set 'key->private.mm' to 'current->mm', getting an extra reference on 'current->mm' is quite pointless, because as long as the task is blocked it isn't going to go away. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2020-03-06 11:06:18 +01:00
Peter Zijlstra	3867913c45	Merge branch 'locking/urgent'	2020-03-06 11:06:17 +01:00
Peter Zijlstra	8019ad13ef	futex: Fix inode life-time issue As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2020-03-06 11:06:15 +01:00
Eric Biggers	07b24c7c08	crypto: pcrypt - simplify error handling in pcrypt_create_aead() Simplify the error handling in pcrypt_create_aead() by taking advantage of crypto_grab_aead() now handling an ERR_PTR() name and by taking advantage of crypto_drop_aead() now accepting (as a no-op) a spawn that hasn't been grabbed yet. This required also making padata_free_shell() accept a NULL argument. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2020-03-06 12:28:24 +11:00
KP Singh	3e7c67d90e	bpf: Fix bpf_prog_test_run_tracing for !CONFIG_NET test_run.o is not built when CONFIG_NET is not set and bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the linker error: ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to `bpf_prog_test_run_tracing' Add a __weak function in bpf_trace.c to handle this. Fixes: `da00d2f117` ("bpf: Add test ops for BPF_PROG_TYPE_TRACING") Signed-off-by: KP Singh <kpsingh@google.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org	2020-03-05 15:14:58 -08:00
KP Singh	69191754ff	bpf: Remove unnecessary CAP_MAC_ADMIN check While well intentioned, checking CAP_MAC_ADMIN for attaching BPF_MODIFY_RETURN tracing programs to "security_" functions is not necessary as tracing BPF programs already require CAP_SYS_ADMIN. Fixes: `6ba43b761c` ("bpf: Attachment verification for BPF_MODIFY_RETURN") Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200305204955.31123-1-kpsingh@chromium.org	2020-03-05 14:27:22 -08:00
Martin KaFai Lau	849b4d9458	bpf: Do not allow map_freeze in struct_ops map struct_ops map cannot support map_freeze. Otherwise, a struct_ops cannot be unregistered from the subsystem. Fixes: `85d33df357` ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200305013454.535397-1-kafai@fb.com	2020-03-05 14:15:49 -08:00
Martin KaFai Lau	8e5290e710	bpf: Return better error value in delete_elem for struct_ops map The current always succeed behavior in bpf_struct_ops_map_delete_elem() is not ideal for userspace tool. It can be improved to return proper error value. If it is in TOBEFREE, it means unregistration has been already done before but it is in progress and waiting for the subsystem to clear the refcnt to zero, so -EINPROGRESS. If it is INIT, it means the struct_ops has not been registered yet, so -ENOENT. Fixes: `85d33df357` ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200305013447.535326-1-kafai@fb.com	2020-03-05 14:15:48 -08:00
Yonghong Song	1bc7896e9e	bpf: Fix deadlock with rq_lock in bpf_send_signal() When experimenting with bpf_send_signal() helper in our production environment (5.2 based), we experienced a deadlock in NMI mode: #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24 #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012 #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55 #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602 #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227 #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140 #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09 #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47 #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d #17 [ffffc9002219fc90] schedule at ffffffff81a3e219 #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9 #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529 #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602 #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068 The above call stack is actually very similar to an issue reported by Commit `eac9153f2b` ("bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()") by Song Liu. The only difference is bpf_send_signal() helper instead of bpf_get_stack() helper. The above deadlock is triggered with a perf_sw_event. Similar to Commit `eac9153f2b`, the below almost identical reproducer used tracepoint point sched/sched_switch so the issue can be easily caught. /* stress_test.c / #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define THREAD_COUNT 1000 char filename; void worker(void p) { void ptr; int fd; char pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char argv[]) { void ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } and the following command: 1. run `stress_test /bin/ls` in one windown 2. hack bcc trace.py with the following change: --- a/tools/trace.py +++ b/tools/trace.py @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s); __data.tgid = __tgid; __data.pid = __pid; bpf_get_current_comm(&__data.comm, sizeof(__data.comm)); + bpf_send_signal(10); %s %s %s.perf_submit(%s, &__data, sizeof(__data)); 3. in a different window run ./trace.py -p $(pidof stress_test) t:sched:sched_switch The deadlock can be reproduced in our production system. Similar to Song's fix, the fix is to delay sending signal if irqs is disabled to avoid deadlocks involving with rq_lock. With this change, my above stress-test in our production system won't cause deadlock any more. I also implemented a scale-down version of reproducer in the selftest (a subsequent commit). With latest bpf-next, it complains for the following potential deadlock. [ 32.832450] -> #1 (&p->pi_lock){-.-.}: [ 32.833100] _raw_spin_lock_irqsave+0x44/0x80 [ 32.833696] task_rq_lock+0x2c/0xa0 [ 32.834182] task_sched_runtime+0x59/0xd0 [ 32.834721] thread_group_cputime+0x250/0x270 [ 32.835304] thread_group_cputime_adjusted+0x2e/0x70 [ 32.835959] do_task_stat+0x8a7/0xb80 [ 32.836461] proc_single_show+0x51/0xb0 ... [ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}: [ 32.840275] __lock_acquire+0x1358/0x1a20 [ 32.840826] lock_acquire+0xc7/0x1d0 [ 32.841309] _raw_spin_lock_irqsave+0x44/0x80 [ 32.841916] __lock_task_sighand+0x79/0x160 [ 32.842465] do_send_sig_info+0x35/0x90 [ 32.842977] bpf_send_signal+0xa/0x10 [ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000 [ 32.844301] trace_call_bpf+0x115/0x270 [ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0 [ 32.845411] perf_trace_sched_switch+0x10f/0x180 [ 32.846014] __schedule+0x45d/0x880 [ 32.846483] schedule+0x5f/0xd0 ... [ 32.853148] Chain exists of: [ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock [ 32.853148] [ 32.854451] Possible unsafe locking scenario: [ 32.854451] [ 32.855173] CPU0 CPU1 [ 32.855745] ---- ---- [ 32.856278] lock(&rq->lock); [ 32.856671] lock(&p->pi_lock); [ 32.857332] lock(&rq->lock); [ 32.857999] lock(&(&sighand->siglock)->rlock); Deadlock happens on CPU0 when it tries to acquire &sighand->siglock but it has been held by CPU1 and CPU1 tries to grab &rq->lock and cannot get it. This is not exactly the callstack in our production environment, but sympotom is similar and both locks are using spin_lock_irqsave() to acquire the lock, and both involves rq_lock. The fix to delay sending signal when irq is disabled also fixed this issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com	2020-03-05 14:02:22 -08:00
Cengiz Can	153031a301	blktrace: fix dereference after null check There was a recent change in blktrace.c that added a RCU protection to `q->blk_trace` in order to fix a use-after-free issue during access. However the change missed an edge case that can lead to dereferencing of `bt` pointer even when it's NULL: Coverity static analyzer marked this as a FORWARD_NULL issue with CID 1460458. ``` /kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store() 1898 ret = 0; 1899 if (bt == NULL) 1900 ret = blk_trace_setup_queue(q, bdev); 1901 1902 if (ret == 0) { 1903 if (attr == &dev_attr_act_mask) >>> CID 1460458: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "bt". 1904 bt->act_mask = value; 1905 else if (attr == &dev_attr_pid) 1906 bt->pid = value; 1907 else if (attr == &dev_attr_start_lba) 1908 bt->start_lba = value; 1909 else if (attr == &dev_attr_end_lba) ``` Added a reassignment with RCU annotation to fix the issue. Fixes: `c780e86dd4` ("blktrace: Protect q->blk_trace with RCU") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Cengiz Can <cengiz@kernel.wtf> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-05 13:42:40 -07:00
Tycho Andersen	51891498f2	seccomp: allow TSYNC and USER_NOTIF together The restriction introduced in `7a0df7fbc1` ("seccomp: Make NEW_LISTENER and TSYNC flags exclusive") is mostly artificial: there is enough information in a seccomp user notification to tell which thread triggered a notification. The reason it was introduced is because TSYNC makes the syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and there's no way to distinguish between these two cases (well, I suppose the caller could check all fds it has, then do the syscall, and if the return value was an fd that already existed, then it must be a thread id, but bleh). Matthew would like to use these two flags together in the Chrome sandbox which wants to use TSYNC for video drivers and NEW_LISTENER to proxy syscalls. So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which tells the kernel to just return -ESRCH on a TSYNC error. This way, NEW_LISTENER (and any subsequent seccomp() commands that want to return positive values) don't conflict with each other. Suggested-by: Matthew Denton <mpdenton@google.com> Signed-off-by: Tycho Andersen <tycho@tycho.ws> Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws Signed-off-by: Kees Cook <keescook@chromium.org>	2020-03-04 14:48:54 -08:00
KP Singh	da00d2f117	bpf: Add test ops for BPF_PROG_TYPE_TRACING The current fexit and fentry tests rely on a different program to exercise the functions they attach to. Instead of doing this, implement the test operations for tracing which will also be used for BPF_MODIFY_RETURN in a subsequent patch. Also, clean up the fexit test to use the generated skeleton. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org	2020-03-04 13:41:06 -08:00
KP Singh	6ba43b761c	bpf: Attachment verification for BPF_MODIFY_RETURN - Allow BPF_MODIFY_RETURN attachment only to functions that are: * Whitelisted for error injection by checking within_error_injection_list. Similar discussions happened for the bpf_override_return helper. * security hooks, this is expected to be cleaned up with the LSM changes after the KRSI patches introduce the LSM_HOOK macro: https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/ - The attachment is currently limited to functions that return an int. This can be extended later other types (e.g. PTR). Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200304191853.1529-5-kpsingh@chromium.org	2020-03-04 13:41:05 -08:00
KP Singh	ae24082331	bpf: Introduce BPF_MODIFY_RETURN When multiple programs are attached, each program receives the return value from the previous program on the stack and the last program provides the return value to the attached function. The fmod_ret bpf programs are run after the fentry programs and before the fexit programs. The original function is only called if all the fmod_ret programs return 0 to avoid any unintended side-effects. The success value, i.e. 0 is not currently configurable but can be made so where user-space can specify it at load time. For example: int func_to_be_attached(int a, int b) { <--- do_fentry do_fmod_ret: <update ret by calling fmod_ret> if (ret != 0) goto do_fexit; original_function: <side_effects_happen_here> } <--- do_fexit The fmod_ret program attached to this function can be defined as: SEC("fmod_ret/func_to_be_attached") int BPF_PROG(func_name, int a, int b, int ret) { // This will skip the original function logic. return 1; } The first fmod_ret program is passed 0 in its return argument. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org	2020-03-04 13:41:05 -08:00
KP Singh	88fd9e5352	bpf: Refactor trampoline update code As we need to introduce a third type of attachment for trampolines, the flattened signature of arch_prepare_bpf_trampoline gets even more complicated. Refactor the prog and count argument to arch_prepare_bpf_trampoline to use bpf_tramp_progs to simplify the addition and accounting for new attachment types. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org	2020-03-04 13:41:05 -08:00
Tycho Andersen	2e5383d790	cgroup1: don't call release_agent when it is "" Older (and maybe current) versions of systemd set release_agent to "" when shutting down, but do not set notify_on_release to 0. Since `64e90a8acb` ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()"), we filter out such calls when the user mode helper path is "". However, when used in conjunction with an actual (i.e. non "") STATIC_USERMODEHELPER, the path is never "", so the real usermode helper will be called with argv[0] == "". Let's avoid this by not invoking the release_agent when it is "". Signed-off-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-03-04 11:53:33 -05:00
Yu Chen	2333e82995	workqueue: Make workqueue_init*() return void The return values of workqueue_init() and workqueue_early_int() are always 0, and there is no usage of their return value. So just make them return void. Signed-off-by: Yu Chen <chen.yu@easystack.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-03-04 11:21:49 -05:00
Qian Cai	190ecb190a	cgroup: fix psi_show() crash on 32bit ino archs Similar to the commit `d749534322` ("cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not equal to 1 on 32bit ino archs which triggers all sorts of issues with psi_show() on s390x. For example, BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/ Read of size 4 at addr 000000001e0ce000 by task read_all/3667 collect_percpu_times+0x2d0/0x798 psi_show+0x7c/0x2a8 seq_read+0x2ac/0x830 vfs_read+0x92/0x150 ksys_read+0xe2/0x188 system_call+0xd8/0x2b4 Fix it by using cgroup_ino(). Fixes: `743210386c` ("cgroup: use cgrp->kn->id as the cgroup ID") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.5	2020-03-04 11:15:58 -05:00
Waiman Long	d441dceb5d	tick/common: Make tick_periodic() check for missing ticks The tick_periodic() function is used at the beginning part of the bootup process for time keeping while the other clock sources are being initialized. The current code assumes that all the timer interrupts are handled in a timely manner with no missing ticks. That is not actually true. Some ticks are missed and there are some discrepancies between the tick time (jiffies) and the timestamp reported in the kernel log. Some systems, however, are more prone to missing ticks than the others. In the extreme case, the discrepancy can actually cause a soft lockup message to be printed by the watchdog kthread. For example, on a Cavium ThunderX2 Sabre arm64 system: [ 25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! On that system, the missing ticks are especially prevalent during the smp_init() phase of the boot process. With an instrumented kernel, it was found that it took about 24s as reported by the timestamp for the tick to accumulate 4s of time. Investigation and bisection done by others seemed to point to the commit `73f3816609` ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") as the culprit. It could also be a firmware issue as new firmware was promised that would fix the issue. To properly address this problem, stop assuming that there will be no missing tick in tick_periodic(). Modify it to follow the example of tick_do_update_jiffies64() by using another reference clock to check for missing ticks. Since the watchdog timer uses running_clock(), it is used here as the reference. With this applied, the soft lockup problem in the affected arm64 system is gone and tick time tracks much more closely to the timestamp time. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200207193929.27308-1-longman@redhat.com	2020-03-04 10:18:11 +01:00
Wen Yang	38f7b0b131	hrtimer: Cast explicitely to u32t in __ktime_divns() do_div() does a 64-by-32 division at least on 32bit platforms, while the divisor 'div' is explicitly casted to unsigned long, thus 64-bit on 64-bit platforms. The code already ensures that the divisor is less than 2^32. Hence the proper cast type is u32. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200130130851.29204-1-wenyang@linux.alibaba.com	2020-03-04 10:17:51 +01:00
Wen Yang	4cbbc3a0ee	timekeeping: Prevent 32bit truncation in scale64_check_overflow() While unlikely the divisor in scale64_check_overflow() could be >= 32bit in scale64_check_overflow(). do_div() truncates the divisor to 32bit at least on 32bit platforms. Use div64_u64() instead to avoid the truncation to 32-bit. [ tglx: Massaged changelog ] Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com	2020-03-04 10:17:51 +01:00
Eric W. Biederman	b95e31c07c	posix-cpu-timers: Stop disabling timers on mt-exec The reasons why the extra posix_cpu_timers_exit_group() invocation has been added are not entirely clear from the commit message. Today all that posix_cpu_timers_exit_group() does is stop timers that are tracking the task from firing. Every other operation on those timers is still allowed. The practical implication of this is posix_cpu_timer_del() which could not get the siglock after the thread group leader has exited (because sighand == NULL) would be able to run successfully because the timer was already dequeued. With that locking issue fixed there is no point in disabling all of the timers. So remove this ``tempoary'' hack. Fixes: `e0a7021710` ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org	2020-03-04 09:54:55 +01:00
Eric W. Biederman	55e8c8eb2c	posix-cpu-timers: Store a reference to a pid not a task posix cpu timers do not handle the death of a process well. This is most clearly seen when a multi-threaded process calls exec from a thread that is not the leader of the thread group. The posix cpu timer code continues to pin the old thread group leader and is unable to find the siglock from there. This results in posix_cpu_timer_del being unable to delete a timer, posix_cpu_timer_set being unable to set a timer. Further to compensate for the problems in posix_cpu_timer_del on a multi-threaded exec all timers that point at the multi-threaded task are stopped. The code for the timers fundamentally needs to check if the target process/thread is alive. This needs an extra level of indirection. This level of indirection is already available in struct pid. So replace cpu.task with cpu.pid to get the needed extra layer of indirection. In addition to handling things more cleanly this reduces the amount of memory a timer can pin when a process exits and then is reaped from a task_struct to the vastly smaller struct pid. Fixes: `e0a7021710` ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87wo86tz6d.fsf@x220.int.ebiederm.org	2020-03-04 09:54:55 +01:00
Qian Cai	a534e924c5	PM: QoS: annotate data races in pm_qos_*_value() The target_value field in struct pm_qos_constraints is used for lockless access to the effective constraint value of a given QoS list, so the readers of it cannot expect it to always reflect the most recent effective constraint value. However, they can and do expect it to be equal to a valid effective constraint value computed at a certain time in the past (event though it may not be the most recent one), so add READ\|WRITE_ONCE() annotations around the target_value accesses to prevent the compiler from possibly causing that expectation to be unmet by generating code in an exceptionally convoluted way. Signed-off-by: Qian Cai <cai@lca.pw> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-03-03 23:34:51 +01:00
Steven Rostedt (VMware)	5412e0b763	tracing: Remove unused TRACE_BUFFER bits Commit `567cd4da54` ("ring-buffer: User context bit recursion checking") added the TRACE_BUFFER bits to be used in the current task's trace_recursion field. But the final submission of the logic removed the use of those bits, but never removed the bits themselves (they were never used in upstream Linux). These can be safely removed. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-03-03 17:34:02 -05:00
Steven Rostedt (VMware)	b396bfdebf	tracing: Have hwlat ts be first instance and record count of instances The hwlat tracer runs a loop of width time during a given window. It then reports the max latency over a given threshold and records a timestamp. But this timestamp is the time after the width has finished, and not the time it actually triggered. Record the actual time when the latency was greater than the threshold as well as the number of times it was greater in a given width per window. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-03-03 17:33:43 -05:00
Cyril Hrubis	ecc421e05b	sys/sysinfo: Respect boottime inside time namespace The sysinfo() syscall includes uptime in seconds but has no correction for time namespaces which makes it inconsistent with the /proc/uptime inside of a time namespace. Add the missing time namespace adjustment call. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dmitry Safonov <dima@arista.com> Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz	2020-03-03 19:34:32 +01:00
Andrii Nakryiko	70ed506c3b	bpf: Introduce pinnable bpf_link abstraction Introduce bpf_link abstraction, representing an attachment of BPF program to a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates ownership of attached BPF program, reference counting of a link itself, when reference from multiple anonymous inodes, as well as ensures that release callback will be called from a process context, so that users can safely take mutex locks and sleep. Additionally, with a new abstraction it's now possible to generalize pinning of a link object in BPF FS, allowing to explicitly prevent BPF program detachment on process exit by pinning it in a BPF FS and let it open from independent other process to keep working with it. Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF program attachments) into utilizing bpf_link framework, making them pinnable in BPF FS. More FD-based bpf_links will be added in follow up patches. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com	2020-03-02 22:06:27 -08:00
Oleg Nesterov	6fb614920b	task_work_run: don't take ->pi_lock unconditionally As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg() if task->task_works == NULL && !PF_EXITING. And in fact the only reason why task_work_run() needs ->pi_lock is the possible race with task_work_cancel(), we can optimize this code and make the locking more clear. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:33 -07:00
Linus Torvalds	c105df5d86	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a scheduler statistics bug" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix statistics for find_idlest_group()	2020-03-02 06:51:43 -06:00
Eric W. Biederman	beb41d9cbe	posix-cpu-timers: Pass the task into arm_timer() The task has been already computed to take siglock before calling arm_timer. So pass the benefit of that labor into arm_timer(). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/8736auvdt1.fsf@x220.int.ebiederm.org	2020-03-01 11:21:44 +01:00
Eric W. Biederman	60f2ceaa81	posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group As of `e78c349679` ("time, signal: Protect resource use statistics with seqlock") cpu_clock_sample_group no longers needs siglock protection. Unfortunately no one realized it at the time. Remove the extra locking that is for cpu_clock_sample_group and not for cpu_clock_sample. This significantly simplifies the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/878skmvdts.fsf@x220.int.ebiederm.org	2020-03-01 11:21:44 +01:00
Eric W. Biederman	a2efdbf4fc	posix-cpu-timers: cpu_clock_sample_group() no longer needs siglock As of `e78c349679` ("time, signal: Protect resource use statistics with seqlock") cpu_clock_sample_group() no longer needs siglock protection so remove the stale comment. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87eeuevduq.fsf@x220.int.ebiederm.org	2020-03-01 11:21:43 +01:00
David S. Miller	9f0ca0c1a5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-02-28 The following pull-request contains BPF updates for your net-next tree. We've added 41 non-merge commits during the last 7 day(s) which contain a total of 49 files changed, 1383 insertions(+), 499 deletions(-). The main changes are: 1) BPF and Real-Time nicely co-exist. 2) bpftool feature improvements. 3) retrieve bpf_sk_storage via INET_DIAG. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-29 15:53:35 -08:00
Eric W. Biederman	af9fe6d607	pid: Improve the comment about waiting in zap_pid_ns_processes Oleg wrote a very informative comment, but with the removal of proc_cleanup_work it is no longer accurate. Rewrite the comment so that it only talks about the details that are still relevant, and hopefully is a little clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-02-28 16:29:12 -06:00
Linus Torvalds	2edc78b9a4	block-5.6-2020-02-28 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5ZXl0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmltEACSA4yxdvWsVMYRCijjm/FzBEq7C8PSsWNK H8KPmjQiNpbSiZSi1uMVsHMlhBmBM8ZQ6Zc+gbZSs6xMqa4yP/iRtmzxnGonC7TB f5Ne2QuC0+TKMFJJTG8cCTzrgEOrWYkFKkmabzDml7HtloJtuzgArrmPzRj2sUfY J+d0osdp1b4U4sqhhAnxSm/zYJkGrQb+9UgNdVjhZCUzaX6oCcuK8xUwu2reLGlM qPkSKOywnl3WHCSCJXsCrNLKX0QWtIfMzlWDr40GYgHauPBbWfa8+1yHR1/lWP4R zyxGk63I9f6/+iQSUC72wP77bAVWKW674c53jgd7r1pNL9TiuK+a3E4lgf7eU+rl ymA/rM6Iy3SjTgiLT57PPOecsILJns3cwZ6mhvSRs0+zpao7LOQZXWdu9V0+Fyqo jur+7Ll/Qfdv/CLlM94DeBJtwhaTWiHTfDoaDHlG9p1/vvcWWXTUTIVPwAD+YGbj geio/bIWECnQxDtZL5Jikf5zsC76aQ46vvxK4F6RJlXj6jaugIbN3mWLsg17sUVf Y4h+IEVtQr0zA0LkPrfVdAS9IqVlTrMRDCkrrlhsDt7FI0orCOag7JOcmN2/nPn/ 2H22nl6i02b0gdGrScU5pyBswSPaImddH5tqE9uL2rK4hrFe6oKxL5EicTFDZmTh tHnukoc+Yg== =1bzv -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Passthrough insertion fix (Ming) - Kill off some unused arguments (John) - blktrace RCU fix (Jan) - Dead fields removal for null_blk (Dongli) - NVMe polled IO fix (Bijan) * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block: nvme-pci: Hold cq_poll_lock while completing CQEs blk-mq: Remove some unused function arguments null_blk: remove unused fields in 'nullb_cmd' blktrace: Protect q->blk_trace with RCU blk-mq: insert passthrough request into hctx->dispatch directly	2020-02-28 11:43:30 -08:00
Eric W. Biederman	69879c01a0	proc: Remove the now unnecessary internal mount of proc There remains no more code in the kernel using pids_ns->proc_mnt, therefore remove it from the kernel. The big benefit of this change is that one of the most error prone and tricky parts of the pid namespace implementation, maintaining kernel mounts of proc is removed. In addition removing the unnecessary complexity of the kernel mount fixes a regression that caused the proc mount options to be ignored. Now that the initial mount of proc comes from userspace, those mount options are again honored. This fixes Android's usage of the proc hidepid option. Reported-by: Alistair Strachan <astrachan@google.com> Fixes: `e94591d0d9` ("proc: Convert proc_mount to use mount_ns.") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-02-28 12:06:14 -06:00
Linus Torvalds	3642859812	Power management fixes for 5.6-rc4 Fix a recent cpufreq initialization regression (Rafael Wysocki), revert a devfreq commit that made incompatible changes and broke user land on some systems (Orson Zhai), drop a stale reference to a document that has gone away recently (Jonathan Neuschäfer) and fix a typo in a hibernation code comment (Alexandre Belloni). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl5Y6MYSHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxqCAP/3zmwDsBMG07c+g9kOEJzuGLkpUrGIFp t1vR2+EmV7bCq7onCLSBz090wvogLCmNPpXmq45Ddt5hx3ltZSx4stwYdbsS/xmU t+9pFiCSSFufs1vaUzZqCuw53jpizRD4KKoUbCT2kpCgH4wjF7ftRIC5y2DYI57Y hOIfrI9XzwR2UZYRWZqCYtwpPkMicao8lXLfhs1mrZFk+AxwwwUFSp8/Gw94LBPD +ESctvsrGfmqEDEvcZUaYd2i0i5PsqnbnVy6wxWb3rhmhSXJTfiCzCGECItBJuSA HcZj3m2dohOXDxXK/Yevwiv+fBkqeeeP+KVlvtjzmYzTDW1bryLYQpxAsdpeYGEE bANjAl6MARqBDKUjvw+N6yDdAPPNbsT6zEK3f0w4eQNRpfmjhHP7TxrOsjC0DRJu EZpN6hWGQs1Vmndr+xaK0ITeRXB+7IQeXG6t6XDYZg2gS65hkhF5z1E5vjmn1Cen wpQsqvek6K7/DqqLqCH27yATNNKff7PNNCfpIazPE6CjDKBAiM9jsMpWkQf4E9/7 VKJFzmhy5P9VsHnUNonRN1K8Vl5eo+b7YOfmR8kvfsmNpfvKqaXM813owVMkOEq6 44kc2D38t3TnlH+GNgGfyPBnHchqlw/P0y0mypQ73Wyg+PQB+fx+W9vAXWDe6fBW /tIhZYILu90f =ySea -----END PGP SIGNATURE----- Merge tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Fix a recent cpufreq initialization regression (Rafael Wysocki), revert a devfreq commit that made incompatible changes and broke user land on some systems (Orson Zhai), drop a stale reference to a document that has gone away recently (Jonathan Neuschäfer), and fix a typo in a hibernation code comment (Alexandre Belloni)" * tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: Fix policy initialization for internal governor drivers Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs" PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst	2020-02-28 08:49:52 -08:00
Madhuparna Bhowmik	22a34c6fe0	exit: Fix Sparse errors and warnings This patch fixes the following sparse error: kernel/exit.c:627:25: error: incompatible types in comparison expression And the following warning: kernel/exit.c:626:40: warning: incorrect type in assignment Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [christian.brauner@ubuntu.com: edit commit message] Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-02-28 13:34:39 +01:00
Madhuparna Bhowmik	0c282b068e	fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer() Use RCU_INIT_POINTER() instead of rcu_access_pointer() in copy_sighand(). Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> [christian.brauner@ubuntu.com: edit commit message] Link: https://lore.kernel.org/r/20200127175821.10833-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-02-28 13:34:36 +01:00
Rafael J. Wysocki	189c6967fe	Merge branches 'pm-sleep' and 'pm-devfreq' * pm-sleep: PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst * pm-devfreq: Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"	2020-02-28 11:00:50 +01:00
Martin KaFai Lau	1ed4d92458	bpf: INET_DIAG support in bpf_sk_storage This patch adds INET_DIAG support to bpf_sk_storage. 1. Although this series adds bpf_sk_storage diag capability to inet sk, bpf_sk_storage is in general applicable to all fullsock. Hence, the bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller will pass in its specific nesting nlattr (e.g. INET_DIAG_) as the argument. 2. The request will be like: INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) ...... Considering there could have multiple bpf_sk_storages in a sk, instead of reusing INET_DIAG_INFO ("ss -i"), the user can select some specific bpf_sk_storage to dump by specifying an array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD. If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages of a sk. 3. The reply will be like: INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) ...... 4. Unlike other INET_DIAG info of a sk which is pretty static, the size required to dump the bpf_sk_storage(s) of a sk is dynamic as the system adding more bpf_sk_storage_map. It is hard to set a static min_dump_alloc size. Hence, this series learns it at the runtime and adjust the cb->min_dump_alloc as it iterates all sk(s) of a system. The "unsigned int res_diag_size" in bpf_sk_storage_diag_put() is for this purpose. The next patch will update the cb->min_dump_alloc as it iterates the sk(s). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com	2020-02-27 18:50:19 -08:00
David S. Miller	9f6e055907	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net The mptcp conflict was overlapping additions. The SMC conflict was an additional and removal happening at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 18:31:39 -08:00
Gustavo A. R. Silva	d7f10df862	bpf: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor	2020-02-28 01:21:02 +01:00
Linus Torvalds	ed5fa55918	audit/stable-5.6 PR 20200226 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl5XHiYUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOn4w/9EWM7O3ZNr1BJb/p6Z1DxAHi1e11s UC5G6njvfihvddS1dhT00eSUX/Bk9WrCLNX5mzXbi5SD8iPP1bDk2L/1xFuR/Dji DHOruYsCqBI6t/rmEhwJiAaMWWQnGzHVfL+Fh7cOBhk2GNdfLUSs+BPXrRmchtH5 OlOogvpMCigK4pvYFN1AXNpvHnBYTvWrmC32KIjj9WcyGuY9Uz0TI+AOnKRJAp/Q YKV7vyw6ZpEt2/nT9lQHTWrMrsXLIybJLRF+BX9pVsz3bzi3u5FqROF1FntIOv1R OF5cI7ly9ixY2VMyxh/0k2X5aB/6BqyB2kmk58x0Wbqxcwi+rOqnMtm5chwpjQ+u bWN+mcB8XGe2Rj+/jjLjoiZReX6CnMuXZe+JNNYMJ+gV6x/Y9aCfg/G5K2ZvXdE2 Cc1mkhGACivLDlYxgpoELR5UC7KEQ2rlYenP6nYiNEsqyMCYPw6g0Rac4dvjppYv YYOp0xyZuK9U+aCGg6Tk2UoOgHRSsG7HUK3yn3Zy8ol5A5NtpOS+8wyyT1e3B/iX xjQfaGx6P66r9qhs6ikVMbOcx9ZZxJXERiA6pNjlGcX4HyZdOokuP946NSYJZ4TK EUA9f4KcKIyBrDMFne2r1O8fNLybkU+2/HkFdak8eMTVtJx0l9oPEAZV/6TOReBv Smf0PXFLTv/oRzc= =DyVs -----END PGP SIGNATURE----- Merge tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two fixes for problems found by syzbot: - Moving audit filter structure fields into a union caused some problems in the code which populates that filter structure. We keep the union (that idea is a good one), but we are fixing the code so that it doesn't needlessly set fields in the union and mess up the error handling. - The audit_receive_msg() function wasn't validating user input as well as it should in all cases, we add the necessary checks" * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: always check the netlink payload length in audit_receive_msg() audit: fix error handling in audit_data_to_entry()	2020-02-27 11:01:22 -08:00
Vincent Guittot	289de35984	sched/fair: Fix statistics for find_idlest_group() sgs->group_weight is not set while gathering statistics in update_sg_wakeup_stats(). This means that a group can be classified as fully busy with 0 running tasks if utilization is high enough. This path is mainly used for fork and exec. Fixes: `57abff067a` ("sched/fair: Rework find_idlest_group()") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20200218144534.4564-1-vincent.guittot@linaro.org	2020-02-27 10:08:27 +01:00
Linus Torvalds	91ad64a84e	Tracing updates: Change in API of bootconfig (before it comes live in a release) - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXlW4vxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qtioAP0WLEm3dWO0z3321h/a0DSshC+Bslu3 HDPTsGVGrXmvggEA/lr1ikRHd8PsO7zW8BfaZMxoXaTqXiuSrzEWxnMlFw0= =O8PM -----END PGP SIGNATURE----- Merge tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64	2020-02-26 10:34:42 -08:00
Linus Torvalds	fda31c5029	signal: avoid double atomic counter increments for user accounting When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <feng.tang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Philip Li <philip.li@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-02-26 09:54:03 -08:00
Masami Hiramatsu	2910b5aa6f	bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue Since commit `d8a953ddde` ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") also changed the CONFIG_BOOTTIME_TRACING to select CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu, it introduced wrong dependencies with BLK_DEV_INITRD as below. WARNING: unmet direct dependencies detected for BOOT_CONFIG Depends on [n]: BLK_DEV_INITRD [=n] Selected by [y]: - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y] This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so that both boot-time tracing and boot configuration off but those appear on the menu list. Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2 Fixes: `d8a953ddde` ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") Reported-by: Randy Dunlap <rdunlap@infradead.org> Compiled-tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-25 19:07:58 -05:00
Jan Kara	c780e86dd4	blktrace: Protect q->blk_trace with RCU KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Tristan Madani <tristmd@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-25 08:40:07 -07:00
David Miller	099bfaa731	bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled In a RT kernel down_read_trylock() cannot be used from NMI context and up_read_non_owner() is another problematic issue. So in such a configuration, simply elide the annotated stackmap and just report the raw IPs. In the longer term, it might be possible to provide a atomic friendly versions of the page cache traversal which will at least provide the info if the pages are resident and don't need to be paged in. [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work callback and add a comment ] Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.708960317@linutronix.de	2020-02-24 16:20:10 -08:00
Thomas Gleixner	66150d0dde	bpf, lpm: Make locking RT friendly The LPM trie map cannot be used in contexts like perf, kprobes and tracing as this map type dynamically allocates memory. The memory allocation happens with a raw spinlock held which is a truly spinning lock on a PREEMPT RT enabled kernel which disables preemption and interrupts. As RT does not allow memory allocation from such a section for various reasons, convert the raw spinlock to a regular spinlock. On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks which provide the proper protection but keep the code preemptible. On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does not cause any functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.602129531@linutronix.de	2020-02-24 16:20:10 -08:00
Thomas Gleixner	7f805d17f1	bpf: Prepare hashtab locking for PREEMPT_RT PREEMPT_RT forbids certain operations like memory allocations (even with GFP_ATOMIC) from atomic contexts. This is required because even with GFP_ATOMIC the memory allocator calls into code pathes which acquire locks with long held lock sections. To ensure the deterministic behaviour these locks are regular spinlocks, which are converted to 'sleepable' spinlocks on RT. The only true atomic contexts on an RT kernel are the low level hardware handling, scheduling, low level interrupt handling, NMIs etc. None of these contexts should ever do memory allocations. As regular device interrupt handlers and soft interrupts are forced into thread context, the existing code which does spin_lock(); alloc(GPF_ATOMIC); spin_unlock(); just works. In theory the BPF locks could be converted to regular spinlocks as well, but the bucket locks and percpu_freelist locks can be taken from arbitrary contexts (perf, kprobes, tracepoints) which are required to be atomic contexts even on RT. These mechanisms require preallocated maps, so there is no need to invoke memory allocations within the lock held sections. BPF maps which need dynamic allocation are only used from (forced) thread context on RT and can therefore use regular spinlocks which in turn allows to invoke memory allocations from the lock held section. To achieve this make the hash bucket lock a union of a raw and a regular spinlock and initialize and lock/unlock either the raw spinlock for preallocated maps or the regular variant for maps which require memory allocations. On a non RT kernel this distinction is neither possible nor required. spinlock maps to raw_spinlock and the extra code and conditional is optimized out by the compiler. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.509685912@linutronix.de	2020-02-24 16:20:10 -08:00
Thomas Gleixner	d01f9b198c	bpf: Factor out hashtab bucket lock operations As a preparation for making the BPF locking RT friendly, factor out the hash bucket lock operations into inline functions. This allows to do the necessary RT modification in one place instead of sprinkling it all over the place. No functional change. The now unused htab argument of the lock/unlock functions will be used in the next step which adds PREEMPT_RT support. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.420416916@linutronix.de	2020-02-24 16:20:10 -08:00
Thomas Gleixner	b6e5dae15a	bpf: Replace open coded recursion prevention in sys_bpf() The required protection is that the caller cannot be migrated to a different CPU as these functions end up in places which take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_[dis\|en]able and __this_cpu_[inc\|dec] pairs with the new helper functions. These functions are already prepared to make BPF work on PREEMPT_RT enabled kernels. No functional change for !RT kernels. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.317843926@linutronix.de	2020-02-24 16:20:10 -08:00
Thomas Gleixner	085fee1a72	bpf: Use recursion prevention helpers in hashtab code The required protection is that the caller cannot be migrated to a different CPU as these places take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_disable/enable() and this_cpu_inc/dec() pairs with the new recursion prevention helpers to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable in the helpers map to preempt_disable/enable(), i.e. no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.211208533@linutronix.de	2020-02-24 16:20:10 -08:00
David Miller	02ad059654	bpf: Use migrate_disable/enabe() in trampoline code. Instead of preemption disable/enable to reflect the purpose. This allows PREEMPT_RT to substitute it with an actual migration disable implementation. On non RT kernels this is still mapped to preempt_disable/enable(). Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.891428873@linutronix.de	2020-02-24 16:20:09 -08:00
David Miller	3d9f773cf2	bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. All of these cases are strictly of the form: preempt_disable(); BPF_PROG_RUN(...); preempt_enable(); Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN() with: migrate_disable(); BPF_PROG_RUN(...); migrate_enable(); On non RT enabled kernels this maps to preempt_disable/enable() and on RT enabled kernels this solely prevents migration, which is sufficient as there is no requirement to prevent reentrancy to any BPF program from a preempting task. The only requirement is that the program stays on the same CPU. Therefore, this is a trivially correct transformation. The seccomp loop does not need protection over the loop. It only needs protection per BPF filter program [ tglx: Converted to bpf_prog_run_pin_on_cpu() ] Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de	2020-02-24 16:20:09 -08:00
Thomas Gleixner	569de905eb	bpf: Dont iterate over possible CPUs with interrupts disabled pcpu_freelist_populate() is disabling interrupts and then iterates over the possible CPUs. The reason why this disables interrupts is to silence lockdep because the invoked ___pcpu_freelist_push() takes spin locks. Neither the interrupt disabling nor the locking are required in this function because it's called during initialization and the resulting map is not yet visible to anything. Split out the actual push assignement into an inline, call it from the loop and remove the interrupt disable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.365930116@linutronix.de	2020-02-24 16:18:20 -08:00
Thomas Gleixner	8a37963c7a	bpf: Remove recursion prevention from rcu free callback If an element is freed via RCU then recursion into BPF instrumentation functions is not a concern. The element is already detached from the map and the RCU callback does not hold any locks on which a kprobe, perf event or tracepoint attached BPF program could deadlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de	2020-02-24 16:18:20 -08:00
Thomas Gleixner	1d7bf6b7d3	perf/bpf: Remove preempt disable around BPF invocation The BPF invocation from the perf event overflow handler does not require to disable preemption because this is called from NMI or at least hard interrupt context which is already non-preemptible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.151953573@linutronix.de	2020-02-24 16:18:20 -08:00
Thomas Gleixner	b0a81b94cc	bpf/trace: Remove redundant preempt_disable from trace_call_bpf() Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de	2020-02-24 16:18:20 -08:00
Alexei Starovoitov	70ed0706a4	bpf: disable preemption for bpf progs attached to uprobe trace_call_bpf() no longer disables preemption on its own. All callers of this function has to do it explicitly. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de>	2020-02-24 16:17:14 -08:00
Thomas Gleixner	1b7a51a63b	bpf/trace: Remove EXPORT from trace_call_bpf() All callers are built in. No point to export this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-02-24 16:16:38 -08:00
Thomas Gleixner	f03efe49bd	bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation. This is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de	2020-02-24 16:12:20 -08:00
Thomas Gleixner	dbca151cad	bpf: Update locking comment in hashtab code The comment where the bucket lock is acquired says: /* bpf_map_update_elem() can be called in_irq() */ which is not really helpful and aside of that it does not explain the subtle details of the hash bucket locks expecially in the context of BPF and perf, kprobes and tracing. Add a comment at the top of the file which explains the protection scopes and the details how potential deadlocks are prevented. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de	2020-02-24 16:12:20 -08:00
Thomas Gleixner	2ed905c521	bpf: Enforce preallocation for instrumentation programs on RT Aside of the general unsafety of run-time map allocation for instrumentation type programs RT enabled kernels have another constraint: The instrumentation programs are invoked with preemption disabled, but the memory allocator spinlocks cannot be acquired in atomic context because they are converted to 'sleeping' spinlocks on RT. Therefore enforce map preallocation for these programs types when RT is enabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145642.648784007@linutronix.de	2020-02-24 16:12:19 -08:00
Thomas Gleixner	94dacdbd5d	bpf: Tighten the requirements for preallocated hash maps The assumption that only programs attached to perf NMI events can deadlock on memory allocators is wrong. Assume the following simplified callchain: kmalloc() from regular non BPF context cache empty freelist empty lock(zone->lock); tracepoint or kprobe BPF() update_elem() lock(bucket) kmalloc() cache empty freelist empty lock(zone->lock); <- DEADLOCK There are other ways which do not involve locking to create wreckage: kmalloc() from regular non BPF context local_irq_save(); ... obj = slab_first(); kprobe() BPF() update_elem() lock(bucket) kmalloc() local_irq_save(); ... obj = slab_first(); <- Same object as above ... So preallocation _must_ be enforced for all variants of intrusive instrumentation. Unfortunately immediate enforcement would break backwards compatibility, so for now such programs still are allowed to run, but a one time warning is emitted in dmesg and the verifier emits a warning in the verifier log as well so developers are made aware about this and can fix their programs before the enforcement becomes mandatory. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145642.540542802@linutronix.de	2020-02-24 16:12:19 -08:00
Paul Moore	7561252892	audit: always check the netlink payload length in audit_receive_msg() This patch ensures that we always check the netlink payload length in audit_receive_msg() before we take any action on the payload itself. Cc: stable@vger.kernel.org Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com>	2020-02-24 16:38:57 -05:00
Eric W. Biederman	7bc3e6e55a	proc: Use a list of inodes to flush from proc Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot <lkp@intel.com>", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-02-24 10:14:44 -06:00
Mel Gorman	a0f03b617c	sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found When domains are imbalanced or overloaded a search of all CPUs on the target domain is searched and compared with task_numa_compare. In some circumstances, a candidate is found that is an obvious win. o A task can move to an idle CPU and an idle CPU is found o A swap candidate is found that would move to its preferred domain This patch terminates the search when either condition is met. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-14-mgorman@techsingularity.net	2020-02-24 11:36:40 +01:00
Mel Gorman	88cca72c96	sched/numa: Bias swapping tasks based on their preferred node When swapping tasks for NUMA balancing, it is preferred that tasks move to or remain on their preferred node. When considering an imbalance, encourage tasks to move to their preferred node and discourage tasks from moving away from their preferred node. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-13-mgorman@techsingularity.net	2020-02-24 11:36:39 +01:00
Mel Gorman	5fb52dd93a	sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Multiple tasks can attempt to select and idle CPU but fail because numa_migrate_on is already set and the migration fails. Instead of failing, scan for an alternative idle CPU. select_idle_sibling is not used because it requires IRQs to be disabled and it ignores numa_migrate_on allowing multiple tasks to stack. This scan may still fail if there are idle candidate CPUs due to races but if this occurs, it's best that a task stay on an available CPU that move to a contended one. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-12-mgorman@techsingularity.net	2020-02-24 11:36:39 +01:00
Mel Gorman	ff7db0bf24	sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks task_numa_find_cpu() can scan a node multiple times. Minimally it scans to gather statistics and later to find a suitable target. In some cases, the second scan will simply pick an idle CPU if the load is not imbalanced. This patch caches information on an idle core while gathering statistics and uses it immediately if load is not imbalanced to avoid a second scan of the node runqueues. Preference is given to an idle core rather than an idle SMT sibling to avoid packing HT siblings due to linearly scanning the node cpumask. As a side-effect, even when the second scan is necessary, the importance of using select_idle_sibling is much reduced because information on idle CPUs is cached and can be reused. Note that this patch actually makes is harder to move to an idle CPU as multiple tasks can race for the same idle CPU due to a race checking numa_migrate_on. This is addressed in the next patch. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-11-mgorman@techsingularity.net	2020-02-24 11:36:38 +01:00
Vincent Guittot	070f5e860e	sched/fair: Take into account runnable_avg to classify group Take into account the new runnable_avg signal to classify a group and to mitigate the volatility of util_avg in face of intensive migration or new task with random utilization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>" Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-10-mgorman@techsingularity.net	2020-02-24 11:36:37 +01:00
Vincent Guittot	9f68395333	sched/pelt: Add a new runnable average signal Now that runnable_load_avg has been removed, we can replace it by a new signal that will highlight the runnable pressure on a cfs_rq. This signal track the waiting time of tasks on rq and can help to better define the state of rqs. At now, only util_avg is used to define the state of a rq: A rq with more that around 80% of utilization and more than 1 tasks is considered as overloaded. But the util_avg signal of a rq can become temporaly low after that a task migrated onto another rq which can bias the classification of the rq. When tasks compete for the same rq, their runnable average signal will be higher than util_avg as it will include the waiting time and we can use this signal to better classify cfs_rqs. The new runnable_avg will track the runnable time of a task which simply adds the waiting time to the running time. The runnable _avg of cfs_rq will be the /Sum of se's runnable_avg and the runnable_avg of group entity will follow the one of the rq similarly to util_avg. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>" Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net	2020-02-24 11:36:36 +01:00
Vincent Guittot	0dacee1bfa	sched/pelt: Remove unused runnable load average Now that runnable_load_avg is no more used, we can remove it to make space for a new signal. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>" Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net	2020-02-24 11:36:36 +01:00
Mel Gorman	fb86f5b211	sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity The standard load balancer generally tries to keep the number of running tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows tasks to move if there is spare capacity but this causes a conflict and utilisation between NUMA nodes gets badly skewed. This patch uses similar logic between the NUMA balancer and load balancer when deciding if a task migrating to its preferred node can use an idle CPU. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-7-mgorman@techsingularity.net	2020-02-24 11:36:35 +01:00
Vincent Guittot	6499b1b2dd	sched/numa: Replace runnable_load_avg by load_avg Similarly to what has been done for the normal load balancer, we can replace runnable_load_avg by load_avg in numa load balancing and track the other statistics like the utilization and the number of running tasks to get to better view of the current state of a node. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>" Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-6-mgorman@techsingularity.net	2020-02-24 11:36:34 +01:00
Vincent Guittot	6d4d22468d	sched/fair: Reorder enqueue/dequeue_task_fair path The walk through the cgroup hierarchy during the enqueue/dequeue of a task is split in 2 distinct parts for throttled cfs_rq without any added value but making code less readable. Change the code ordering such that everything related to a cfs_rq (throttled or not) will be done in the same loop. In addition, the same steps ordering is used when updating a cfs_rq: - update_load_avg - update_cfs_group - update *h_nr_running This reordering enables the use of h_nr_running in PELT algorithm. No functional and performance changes are expected and have been noticed during tests. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>" Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net	2020-02-24 11:36:34 +01:00
Mel Gorman	b2b2042b20	sched/numa: Distinguish between the different task_numa_migrate() failure cases sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node but from the trace, it's possibile to tell the difference between "no CPU found", "migration to idle CPU failed" and "tasks could not be swapped". Extend the tracepoint accordingly. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-4-mgorman@techsingularity.net	2020-02-24 11:36:33 +01:00
Mel Gorman	f22aef4afb	sched/numa: Trace when no candidate CPU was found on the preferred node sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node. The case where no candidate CPU could be found is not traced which is an important gap. The tracepoint is not fired when the task is not allowed to run on any CPU on the preferred node or the task is already running on the target CPU but neither are interesting corner cases. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-3-mgorman@techsingularity.net	2020-02-24 11:36:32 +01:00
Ingo Molnar	546121b65f	Linux 5.6-rc3 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl5TFjYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGikYIAIhI4C8R87wyj/0m b2NWk6TZ5AFmiZLYSbsPYxdSC9OLdUmlGFKgL2SyLTwZCiHChm+cNBrngp3hJ6gz x1YH99HdjzkiaLa0hCc2+a/aOt8azGU2RiWEP8rbo0gFSk28wE6FjtzSxR95jyPz FRKo/sM+dHBMFXrthJbr+xHZ1De28MITzS2ddstr/10ojoRgm43I3qo1JKhjoDN5 9GGb6v0Md5eo+XZjjB50CvgF5GhpiqW7+HBB7npMsgTk37GdsR5RlosJ/TScLVC9 dNeanuqk8bqMGM0u2DFYdDqjcqAlYbt8aobuWWCB5xgPBXr5G2nox+IgF/f9G6UH EShA/xs= =OFPc -----END PGP SIGNATURE----- Merge tag 'v5.6-rc3' into sched/core, to pick up fixes and dependent patches Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-02-24 11:36:09 +01:00
Paul Moore	2ad3e17ebf	audit: fix error handling in audit_data_to_entry() Commit `219ca39427` ("audit: use union for audit_field values since they are mutually exclusive") combined a number of separate fields in the audit_field struct into a single union. Generally this worked just fine because they are generally mutually exclusive. Unfortunately in audit_data_to_entry() the overlap can be a problem when a specific error case is triggered that causes the error path code to attempt to cleanup an audit_field struct and the cleanup involves attempting to free a stored LSM string (the lsm_str field). Currently the code always has a non-NULL value in the audit_field.lsm_str field as the top of the for-loop transfers a value into audit_field.val (both .lsm_str and .val are part of the same union); if audit_data_to_entry() fails and the audit_field struct is specified to contain a LSM string, but the audit_field.lsm_str has not yet been properly set, the error handling code will attempt to free the bogus audit_field.lsm_str value that was set with audit_field.val at the top of the for-loop. This patch corrects this by ensuring that the audit_field.val is only set when needed (it is cleared when the audit_field struct is allocated with kcalloc()). It also corrects a few other issues to ensure that in case of error the proper error code is returned. Cc: stable@vger.kernel.org Fixes: `219ca39427` ("audit: use union for audit_field values since they are mutually exclusive") Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com>	2020-02-22 20:36:47 -05:00
Linus Torvalds	f3cc24942e	Two fixes for the irq core code which are follow ups to the recent MSI fixes: - The WARN_ON which was put into the MSI setaffinity callback for paranoia reasons actually triggered via a callchain which escaped when all the possible ways to reach that code were analyzed. The proc/irq/$N/affinity interfaces have a quirk which came in when ALPHA moved to the generic interface: In case that the written affinity mask does not contain any online CPU it calls into ALPHAs magic auto affinity setting code. A few years later this mechanism was also made available to x86 for no good reasons and in a way which circumvents all sanity checks for interrupts which cannot have their affinity set from process context on X86 due to the way the X86 interrupt delivery works. It would be possible to make this work properly, but there is no point in doing so. If the interrupt is not yet started then the affinity setting has no effect and if it is started already then it is already assigned to an online CPU so there is no point to randomly move it to some other CPU. Just return EINVAL as the code has done before that change forever. - The new MSI quirk bit in the irq domain flags turned out to be already occupied, which escaped the author and the reviewers because the already in use bits were 0,6,2,3,4,5 listed in that order. That bit 6 was simply overlooked because the ordering was straight forward linear otherwise. So the new bit ended up being a duplicate. Fix it up by switching the oddball 6 to the obvious 1. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5Rh+sTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoQ8gD/0aO/tFJrRLmcSlS5r8r70dtHMHIhcq L6bUkg0GKsud6oLVFEhU43K7aWXZkTEqm1bIjX9x0FnMHIYsASmkHyloP/8OBxoA VOG6Q1THxklge/YLBLz7I2NbZNKs/w+WkRKl63FNV9LmdtxtQblYc67CkaQVXiwC tHuijaKOwT/t4V0+HNRVX15FApiDWOSguwcDNigUwk03Uo+hmJsPXuGtJfvyVAf6 Oa00sbvy/XV3qNr2Zm01Pb4osL4FOwCcGuyoeXMZqSvlRbxgT2qz0TzYObqKzZb5 pvZj7coaxKSimNZsVFQ6mluCp9IvFpZOnp1FXcFrzbbFPxEr0G/IMWgVmgf2s3w/ 8XjFCuF33J4mVz36/HYzR2ieH5FsVcGJQR96ZOHArONmszmbun7D7fC3j8RJlkKC 9QpYDcHK17xbLdhR8YanEIDl5QQT/C/qEOdLWrgON7uWEZLs5Y+u39x479MfknyV fYQqi7CC6qUTxQhrt29puaWy8WQvrG4GzWV+vS9d+8v0FIK2KcbA66p/c7UioV3r F9PhxfneSYjswvLhAPmJCM4PSYWvrYZcoGkT+/OzmPXcm+r+Tg4Stgac0EVmhvq5 rBpenELnSuK1MGiFXzL0DKlyNJhpj+UdeAuCoUzConfBMpMwQdvPXpmnT73CQtRb pOgBI4qB3YgP4Q== =jvVI -----END PGP SIGNATURE----- Merge tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "Two fixes for the irq core code which are follow ups to the recent MSI fixes: - The WARN_ON which was put into the MSI setaffinity callback for paranoia reasons actually triggered via a callchain which escaped when all the possible ways to reach that code were analyzed. The proc/irq/$N/affinity interfaces have a quirk which came in when ALPHA moved to the generic interface: In case that the written affinity mask does not contain any online CPU it calls into ALPHAs magic auto affinity setting code. A few years later this mechanism was also made available to x86 for no good reasons and in a way which circumvents all sanity checks for interrupts which cannot have their affinity set from process context on X86 due to the way the X86 interrupt delivery works. It would be possible to make this work properly, but there is no point in doing so. If the interrupt is not yet started then the affinity setting has no effect and if it is started already then it is already assigned to an online CPU so there is no point to randomly move it to some other CPU. Just return EINVAL as the code has done before that change forever. - The new MSI quirk bit in the irq domain flags turned out to be already occupied, which escaped the author and the reviewers because the already in use bits were 0,6,2,3,4,5 listed in that order. That bit 6 was simply overlooked because the ordering was straight forward linear otherwise. So the new bit ended up being a duplicate. Fix it up by switching the oddball 6 to the obvious 1" * tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/irqdomain: Make sure all irq domain flags are distinct genirq/proc: Reject invalid affinity masks (again)	2020-02-22 17:25:46 -08:00
Linus Torvalds	591dd4c101	s390 updates for 5.6-rc3 - Remove ieee_emulation_warnings sysctl which is a dead code. - Avoid triggering rebuild of the kernel during make install. - Enable protected virtualization guest support in default configs. - Fix cio_ignore seq_file .next function to increase position index. And use kobj_to_dev instead of container_of in cio code. - Fix storage block address lists to contain absolute addresses in qdio code. - Few clang warnings and spelling fixes. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAl5RI0sACgkQjYWKoQLX FBig0Af/bJjFspbGFN+YEO+XM3AbURu0lHxq25r0EoSEc9acfS5nMrUk7OhNktND w+m/PoyOg8elB3LDkalz9IP6CZANSSoGV5aKyg3Wp06Wu//DdwAMJWqFOyosJ89z zVOSGgX3zHFarodS4suDPTEdDEcSMVVvqnz1A81dQD9QNWyVCutqYABj3OEittOD EBYchT7Qs/ObwbP+RaUa0swbrq11P8hZSAblbjkbuFpw45CdQ6rFoyJO1Tccuyh7 puNlPLpkDI15RNxC+vh+4NItcxPcIL+Qa0kltbQJLq7aq/UqLNF02vLjWluH4hG6 i6+ZzsIra5gdtdf2cPFiKo7EagpQ2w== =CRBt -----END PGP SIGNATURE----- Merge tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Remove ieee_emulation_warnings sysctl which is a dead code. - Avoid triggering rebuild of the kernel during make install. - Enable protected virtualization guest support in default configs. - Fix cio_ignore seq_file .next function to increase position index. And use kobj_to_dev instead of container_of in cio code. - Fix storage block address lists to contain absolute addresses in qdio code. - Few clang warnings and spelling fixes. * tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/qdio: fill SBALEs with absolute addresses s390/qdio: fill SL with absolute addresses s390: remove obsolete ieee_emulation_warnings s390: make 'install' not depend on vmlinux s390/kaslr: Fix casts in get_random s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range s390/pkey/zcrypt: spelling s/crytp/crypt/ s390/cio: use kobj_to_dev() API s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST s390/cio: cio_ignore_proc_seq_next should increase position index	2020-02-22 10:43:41 -08:00
Daniel Jordan	41ccdbfd54	padata: fix uninitialized return value in padata_replace() According to Geert's report[0], kernel/padata.c: warning: 'err' may be used uninitialized in this function [-Wuninitialized]: => 539:2 Warning is seen only with older compilers on certain archs. The runtime effect is potentially returning garbage down the stack when padata's cpumasks are modified before any pcrypt requests have run. Simplest fix is to initialize err to the success value. [0] http://lkml.kernel.org/r/20200210135506.11536-1-geert@linux-m68k.org Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: `bbefa1dd6a` ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2020-02-22 09:25:42 +08:00
David S. Miller	b105e8e281	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-02-21 The following pull-request contains BPF updates for your net-next tree. We've added 25 non-merge commits during the last 4 day(s) which contain a total of 33 files changed, 2433 insertions(+), 161 deletions(-). The main changes are: 1) Allow for adding TCP listen sockets into sock_map/hash so they can be used with reuseport BPF programs, from Jakub Sitnicki. 2) Add a new bpf_program__set_attach_target() helper for adding libbpf support to specify the tracepoint/function dynamically, from Eelco Chaudron. 3) Add bpf_read_branch_records() BPF helper which helps use cases like profile guided optimizations, from Daniel Xu. 4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu. 5) Relax BTF mandatory check if only used for libbpf itself e.g. to process BTF defined maps, from Andrii Nakryiko. 6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has been observed that former fails in envs with low memlock, from Yonghong Song. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-21 15:22:45 -08:00
Jakub Sitnicki	035ff358f2	net: Generate reuseport group ID on group creation Commit `736b46027e` ("net: Add ID (if needed) to sock_reuseport and expose reuseport_lock") has introduced lazy generation of reuseport group IDs that survive group resize. By comparing the identifier we check if BPF reuseport program is not trying to select a socket from a BPF map that belongs to a different reuseport group than the one the packet is for. Because SOCKARRAY used to be the only BPF map type that can be used with reuseport BPF, it was possible to delay the generation of reuseport group ID until a socket from the group was inserted into BPF map for the first time. Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options, either generate the reuseport ID on map update, like SOCKARRAY does, or allocate an ID from the start when reuseport group gets created. This patch takes the latter approach to keep sockmap free of calls into reuseport code. This streamlines the reuseport_id access as its lifetime now matches the longevity of reuseport object. The cost of this simplification, however, is that we allocate reuseport IDs for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their setups. With the way identifiers are currently generated, we can have at most S32_MAX reuseport groups, which hopefully is sufficient. If we ever get close to the limit, we can switch an u64 counter like sk_cookie. Another change is that we now always call into SOCKARRAY logic to unlink the socket from the map when unhashing or closing the socket. Previously we did it only when at least one socket from the group was in a BPF map. It is worth noting that this doesn't conflict with sockmap tear-down in case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport group. sockmap tear-down happens first: prot->unhash `- tcp_bpf_unhash \|- tcp_bpf_remove \| `- while (sk_psock_link_pop(psock)) \| `- sk_psock_unlink \| `- sock_map_delete_from_link \| `- __sock_map_delete \| `- sock_map_unref \| `- sk_psock_put \| `- sk_psock_drop \| `- rcu_assign_sk_user_data(sk, NULL) `- inet_unhash `- reuseport_detach_sock `- bpf_sk_reuseport_detach `- WRITE_ONCE(sk->sk_user_data, NULL) Suggested-by: Martin Lau <kafai@fb.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	9fed9000c5	bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH SOCKMAP & SOCKHASH now support storing references to listening sockets. Nothing keeps us from using these map types a collection of sockets to select from in BPF reuseport programs. Whitelist the map types with the bpf_sk_select_reuseport helper. The restriction that the socket has to be a member of a reuseport group still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb set are not a valid target and we signal it with -EINVAL. The main benefit from this change is that, in contrast to REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a listening socket can be just one BPF map at the same time. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Linus Torvalds	3dc55dba67	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from Cong Wang. 2) Fix deadlock in xsk by publishing global consumer pointers when NAPI is finished, from Magnus Karlsson. 3) Set table field properly to RT_TABLE_COMPAT when necessary, from Jethro Beekman. 4) NLA_STRING attributes are not necessary NULL terminated, deal wiht that in IFLA_ALT_IFNAME. From Eric Dumazet. 5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov. 6) Handle mtu==0 devices properly in wireguard, from Jason A. Donenfeld. 7) Fix several lockdep warnings in bonding, from Taehee Yoo. 8) Fix cls_flower port blocking, from Jason Baron. 9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen. 10) Fix RDMA race in qede driver, from Michal Kalderon. 11) Fix several false lockdep warnings by adding conditions to list_for_each_entry_rcu(), from Madhuparna Bhowmik. 12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen. 13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song. 14) Hey, variables declared in switch statement before any case statements are not initialized. I learn something every day. Get rids of this stuff in several parts of the networking, from Kees Cook. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits) bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs. bnxt_en: Improve device shutdown method. net: netlink: cap max groups which will be considered in netlink_bind() net: thunderx: workaround BGX TX Underflow issue ionic: fix fw_status read net: disable BRIDGE_NETFILTER by default net: macb: Properly handle phylink on at91rm9200 s390/qeth: fix off-by-one in RX copybreak check s390/qeth: don't warn for napi with 0 budget s390/qeth: vnicc Fix EOPNOTSUPP precedence openvswitch: Distribute switch variables for initialization net: ip6_gre: Distribute switch variables for initialization net: core: Distribute switch variables for initialization udp: rehash on disconnect net/tls: Fix to avoid gettig invalid tls record bpf: Fix a potential deadlock with bpf_map_do_batch bpf: Do not grab the bucket spinlock by default on htab batch ops ice: Wait for VF to be reset/ready before configuration ice: Don't tell the OS that link is going down ice: Don't reject odd values of usecs set by user ...	2020-02-21 11:59:51 -08:00
Arnd Bergmann	412c53a680	y2038: remove unused time32 interfaces No users remain, so kill these off before we grow new ones. Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-02-21 11:22:15 -08:00
Paul E. McKenney	9470a18fab	rcutorture: Manually clean up after rcu_barrier() failure Currently, if rcu_barrier() returns too soon, the test waits 100ms and then does another instance of the test. However, if rcu_barrier() were to have waited for more than 100ms too short a time, this could cause the test's rcu_head structures to be reused while they were still on RCU's callback lists. This can result in knock-on errors that obscure the original rcu_barrier() test failure. This commit therefore adds code that attempts to wait until all of the test's callbacks have been invoked. Of course, if RCU completely lost track of the corresponding rcu_head structures, this wait could be forever. This commit therefore also complains if this attempted recovery takes more than one second, and it also gives up when the test ends. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Paul E. McKenney	50d4b62970	rcutorture: Make rcu_torture_barrier_cbs() post from corresponding CPU Currently, rcu_torture_barrier_cbs() posts callbacks from whatever CPU it is running on, which means that all these kthreads might well be posting from the same CPU, which would drastically reduce the effectiveness of this test. This commit therefore uses IPIs to make the callbacks be posted from the corresponding CPU (given by local variable myid). If the IPI fails (which can happen if the target CPU is offline or does not exist at all), the callback is posted on whatever CPU is currently running. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Joel Fernandes (Google)	12af660321	rcuperf: Measure memory footprint during kfree_rcu() test During changes to kfree_rcu() code, we often check the amount of free memory. As an alternative to checking this manually, this commit adds a measurement in the test itself. It measures four times during the test for available memory, digitally filters these measurements to produce a running average with a weight of 0.5, and compares this digitally filtered value with the amount of available memory at the beginning of the test. Something like the following is printed at the end of the run: Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Paul E. McKenney	5396d31d3a	rcutorture: Annotation lockless accesses to rcu_torture_current The rcutorture global variable rcu_torture_current is accessed locklessly, so it must use the RCU pointer load/store primitives. This commit therefore adds several that were missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being used only by rcutorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Paul E. McKenney	f042a436c8	rcutorture: Add READ_ONCE() to rcu_torture_count and rcu_torture_batch The rcutorture rcu_torture_count and rcu_torture_batch per-CPU variables are read locklessly, so this commit adds the READ_ONCE() to a load in order to avoid various types of compiler vandalism^Woptimization. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being rcutorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Paul E. McKenney	102c14d2f8	rcutorture: Fix stray access to rcu_fwd_cb_nodelay The rcu_fwd_cb_nodelay variable suppresses excessively long read-side delays while carrying out an rcutorture forward-progress test. As such, it is accessed both by readers and updaters, and most of the accesses therefore use *_ONCE(). Except for one in rcu_read_delay(), which this commit fixes. This data race was reported by KCSAN. Not appropriate for backporting due to this being rcutorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Paul E. McKenney	202489101f	rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race The ->rtort_pipe_count field in the rcu_torture structure checks for too-short grace periods, and is therefore read by rcutorture's readers while being updated by rcutorture's writers. This commit therefore adds the needed READ_ONCE() and WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being rcutorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:31 -08:00
Paul E. McKenney	8171d3e0da	torture: Allow disabling of boottime CPU-hotplug torture operations In theory, RCU-hotplug operations are supposed to work as soon as there is more than one CPU online. However, in practice, in normal production there is no way to make them happen until userspace is up and running. Besides which, on smaller systems, rcutorture doesn't start doing hotplug operations until 30 seconds after the start of boot, which on most systems also means the better part of 30 seconds after the end of boot. This commit therefore provides a new torture.disable_onoff_at_boot kernel boot parameter that suppresses CPU-hotplug torture operations until about the time that init is spawned. Of course, if you know of a need for boottime CPU-hotplug operations, then you should avoid passing this argument to any of the torture tests. You might also want to look at the splats linked to below. Link: https://lore.kernel.org/lkml/20191206185208.GA25636@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	4ab00bdd99	rcutorture: Suppress boottime bad-sequence warnings In normal production, an excessively long wait on a grace period (synchronize_rcu(), for example) at boottime is often just as bad as at any other time. In fact, given the desire for fast boot, any sort of long wait at boot is a bad idea. However, heavy rcutorture testing on large hyperthreaded systems can generate such long waits during boot as a matter of course. This commit therefore causes the rcupdate.rcu_cpu_stall_suppress_at_boot kernel boot parameter to suppress reporting of bootime bad-sequence warning due to excessively long grace-period waits. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	58c53360b3	rcutorture: Allow boottime stall warnings to be suppressed In normal production, an RCU CPU stall warning at boottime is often just as bad as at any other time. In fact, given the desire for fast boot, any sort of long-term stall at boot is a bad idea. However, heavy rcutorture testing on large hyperthreaded systems can generate boottime RCU CPU stalls as a matter of course. This commit therefore provides a kernel boot parameter that suppresses reporting of boottime RCU CPU stall warnings and similarly of rcutorture writer stalls. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	a59ee765a6	torture: Forgive -EBUSY from boottime CPU-hotplug operations During boot, CPU hotplug is often disabled, for example by PCI probing. On large systems that take substantial time to boot, this can result in spurious RCU_HOTPLUG errors. This commit therefore forgives any boottime -EBUSY CPU-hotplug failures by adjusting counters to pretend that the corresponding attempt never happened. A non-splat record of the failed attempt is emitted to the console with the added string "(-EBUSY forgiven during boot)". Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	435508095a	rcutorture: Refrain from callback flooding during boot Additional rcutorture aggression can result in, believe it or not, boot times in excess of three minutes on large hyperthreaded systems. This is long enough for rcutorture to decide to do some callback flooding, which seems a bit excessive given that userspace cannot have started until long after boot, and it is userspace that does the real-world callback flooding. Worse yet, because Tiny RCU lacks forward-progress functionality, the looping-in-the-kernel tests can also be problematic during early boot. This commit therefore causes rcutorture to hold off on callback flooding until about the time that init is spawned, and the same for looping-in-the-kernel tests for Tiny RCU. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	59ee0326cc	rcutorture: Suppress forward-progress complaints during early boot Some larger systems can take in excess of 50 seconds to complete their early boot initcalls prior to spawing init. This does not in any way help the forward-progress judgments of built-in rcutorture (when rcutorture is built as a module, the insmod or modprobe command normally cannot happen until some time after boot completes). This commit therefore suppresses such complaints until about the time that init is spawned. This also includes a fix to a stupid error located by kbuild test robot. [ paulmck: Apply kbuild test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ paulmck: Fix to nohz_full slow-expediting recovery logic, per bpetkov. ] [ paulmck: Restrict splat to CONFIG_PREEMPT_RT=y kernels and simplify. ] Tested-by: Borislav Petkov <bp@alien8.de>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	710426068d	srcu: Hold srcu_struct ->lock when updating ->srcu_gp_seq A read of the srcu_struct structure's ->srcu_gp_seq field should not need READ_ONCE() when that structure's ->lock is held. Except that this lock is not always held when updating this field. This commit therefore acquires the lock around updates and removes a now-unneeded READ_ONCE(). This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ paulmck: Switch from READ_ONCE() to lock per Peter Zilstra question. ] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2020-02-20 16:01:11 -08:00
Paul E. McKenney	39f91504a0	srcu: Fix process_srcu()/srcu_batches_completed() datarace The srcu_struct structure's ->srcu_idx field is accessed locklessly, so reads must use READ_ONCE(). This commit therefore adds the needed READ_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:01:11 -08:00
Paul E. McKenney	8c9e0cb323	srcu: Fix __call_srcu()/srcu_get_delay() datarace The srcu_struct structure's ->srcu_gp_seq_needed_exp field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:01:11 -08:00
Paul E. McKenney	7ff8b4502b	srcu: Fix __call_srcu()/process_srcu() datarace The srcu_node structure's ->srcu_gp_seq_needed_exp field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:01:11 -08:00
Jules Irenge	90ba11ba99	rcu: Add missing annotation for exit_tasks_rcu_finish() Sparse reports a warning at exit_tasks_rcu_finish(void) \|warning: context imbalance in exit_tasks_rcu_finish() \|- wrong count at exit To fix this, this commit adds a __releases(&tasks_rcu_exit_srcu). Given that exit_tasks_rcu_finish() does actually call __srcu_read_lock(), this not only fixes the warning but also improves on the readability of the code. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2020-02-20 16:00:45 -08:00
Jules Irenge	e1e9bdc00a	rcu: Add missing annotation for exit_tasks_rcu_start() Sparse reports a warning at exit_tasks_rcu_start(void) \|warning: context imbalance in exit_tasks_rcu_start() - wrong count at exit To fix this, this commit adds an __acquires(&tasks_rcu_exit_srcu). Given that exit_tasks_rcu_start() does actually call __srcu_read_lock(), this not only fixes the warning but also improves on the readability of the code. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2020-02-20 16:00:45 -08:00
Paul E. McKenney	fcb7381265	rcu-tasks: *_ONCE() for rcu_tasks_cbs_head The RCU tasks list of callbacks, rcu_tasks_cbs_head, is sampled locklessly by rcu_tasks_kthread() when waiting for work to do. This commit therefore applies READ_ONCE() to that lockless sampling and WRITE_ONCE() to the single potential store outside of rcu_tasks_kthread. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:00:45 -08:00
Paul E. McKenney	b692dc4adf	rcu: Update __call_rcu() comments The __call_rcu() function's header comment refers to a cpu argument that no longer exists, and the comment of the return path from rcu_nocb_try_bypass() ignores the non-no-CBs CPU case. This commit therefore update both comments. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:00:20 -08:00
Colin Ian King	aa96a93ba2	rcu: Fix spelling mistake "leval" -> "level" This commit fixes a spelling mistake in a pr_info() message. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:00:20 -08:00
Paul E. McKenney	8c14263d35	rcu: React to callback overload by boosting RCU readers RCU priority boosting currently is not applied until the grace period is at least 250 milliseconds old (or the number of milliseconds specified by the CONFIG_RCU_BOOST_DELAY Kconfig option). Although this has worked well, it can result in OOM under conditions of RCU callback flooding. One can argue that the real-time systems using RCU priority boosting should carefully avoid RCU callback flooding, but one can just as well argue that an OOM is a rather obnoxious error message. This commit therefore disables the RCU priority boosting delay when there are excessive numbers of callbacks queued. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:00:20 -08:00
Paul E. McKenney	b2b00ddf19	rcu: React to callback overload by aggressively seeking quiescent states In default configutions, RCU currently waits at least 100 milliseconds before asking cond_resched() and/or resched_rcu() for help seeking quiescent states to end a grace period. But 100 milliseconds can be one good long time during an RCU callback flood, for example, as can happen when user processes repeatedly open and close files in a tight loop. These 100-millisecond gaps in successive grace periods during a callback flood can result in excessive numbers of callbacks piling up, unnecessarily increasing memory footprint. This commit therefore asks cond_resched() and/or resched_rcu() for help as early as the first FQS scan when at least one of the CPUs has more than 20,000 callbacks queued, a number that can be changed using the new rcutree.qovld kernel boot parameter. An auxiliary qovld_calc variable is used to avoid acquisition of locks that have not yet been initialized. Early tests indicate that this reduces the RCU-callback memory footprint during rcutorture floods by from 50% to 4x, depending on configuration. Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reported-by: Tejun Heo <tj@kernel.org> [ paulmck: Fix bug located by Qian Cai. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Dexuan Cui <decui@microsoft.com> Tested-by: Qian Cai <cai@lca.pw>	2020-02-20 16:00:20 -08:00
Paul E. McKenney	b5ea03709d	rcu: Clear ->core_needs_qs at GP end or self-reported QS The rcu_data structure's ->core_needs_qs field does not necessarily get cleared in a timely fashion after the corresponding CPUs' quiescent state has been reported. From a functional viewpoint, no harm done, but this can result in excessive invocation of RCU core processing, as witnessed by the kernel test robot, which saw greatly increased softirq overhead. This commit therefore restores the rcu_report_qs_rdp() function's clearing of this field, but only when running on the corresponding CPU. Cases where some other CPU reports the quiescent state (for example, on behalf of an idle CPU) are handled by setting this field appropriately within the __note_gp_changes() function's end-of-grace-period checks. This handling is carried out regardless of whether the end of a grace period actually happened, thus handling the case where a CPU goes non-idle after a quiescent state is reported on its behalf, but before the grace period ends. This fix also avoids cross-CPU updates to ->core_needs_qs, While in the area, this commit changes the __note_gp_changes() need_gp variable's name to need_qs because it is a quiescent state that is needed from the CPU in question. Fixes: `ed93dfc6bc` ("rcu: Confine ->core_needs_qs accesses to the corresponding CPU") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:00:20 -08:00
Paul E. McKenney	28e09a2e48	locktorture: Forgive apparent unfairness if CPU hotplug If CPU hotplug testing is enabled, a lock might appear to be maximally unfair just because one of the CPUs was offline almost all the time. This commit therefore forgives unfairness if CPU hotplug testing was enabled. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:59:59 -08:00
Paul E. McKenney	c0e1472d80	locktorture: Use private random-number generators Both lock_torture_writer() and lock_torture_reader() use the "static" keyword on their DEFINE_TORTURE_RANDOM(rand) declarations, which means that a single instance of a random-number generator are shared among all the writers and another is shared among all the readers. Unfortunately, this random-number generator was not designed for concurrent access. This commit therefore removes both "static" keywords so that each reader and each writer gets its own random-number generator. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:59:59 -08:00
Paul E. McKenney	80c503e0e6	locktorture: Print ratio of acquisitions, not failures The __torture_print_stats() function in locktorture.c carefully initializes local variable "min" to statp[0].n_lock_acquired, but then compares it to statp[i].n_lock_fail. Given that the .n_lock_fail field should normally be zero, and given the initialization, it seems reasonable to display the maximum and minimum number acquisitions instead of miscomputing the maximum and minimum number of failures. This commit therefore switches from failures to acquisitions. And this turns out to be not only a day-zero bug, but entirely my own fault. I hate it when that happens! Fixes: `0af3fe1efa` ("locktorture: Add a lock-torture kernel module") Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Peter Zijlstra <peterz@infradead.org>	2020-02-20 15:59:59 -08:00
Uladzislau Rezki (Sony)	613707929b	rcu: Add a trace event for kfree_rcu() use of kfree_bulk() The event is given three parameters, first one is the name of RCU flavour, second one is the number of elements in array for free and last one is an address of the array holding pointers to be freed by the kfree_bulk() function. To enable the trace event your kernel has to be build with CONFIG_RCU_TRACE=y, after that it is possible to track the events using ftrace subsystem. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:51 -08:00
Uladzislau Rezki (Sony)	34c8817455	rcu: Support kfree_bulk() interface in kfree_rcu() The kfree_rcu() logic can be improved further by using kfree_bulk() interface along with "basic batching support" introduced earlier. The are at least two advantages of using "bulk" interface: - in case of large number of kfree_rcu() requests kfree_bulk() reduces the per-object overhead caused by calling kfree() per-object. - reduces the number of cache-misses due to "pointer chasing" between objects which can be far spread between each other. This approach defines a new kfree_rcu_bulk_data structure that stores pointers in an array with a specific size. Number of entries in that array depends on PAGE_SIZE making kfree_rcu_bulk_data structure to be exactly one page. Since it deals with "block-chain" technique there is an extra need in dynamic allocation when a new block is required. Memory is allocated with GFP_NOWAIT \| __GFP_NOWARN flags, i.e. that allows to skip direct reclaim under low memory condition to prevent stalling and fails silently under high memory pressure. The "emergency path" gets maintained when a system is run out of memory. In that case objects are linked into regular list. The "rcuperf" was run to analyze this change in terms of memory consumption and kfree_bulk() throughput. 1) Testing on the Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs with following parameters: kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 kfree_vary_obj_size=1 dev.2020.01.10a branch Default / CONFIG_SLAB 53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB 53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB 53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB Patch / CONFIG_SLAB 23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB 23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB 24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB Default / CONFIG_SLUB 51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB 51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB 51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB Patch / CONFIG_SLUB 50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB 50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB 50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB in case of CONFIG_SLAB there is double increase in performance and slightly higher memory usage. As for CONFIG_SLUB, the performance figures are better together with lower memory usage. 2) Testing on the HiKey-960, arm64, 8xCPUs with below parameters: CONFIG_SLAB=y kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 102898760401 ns, loops: 200000, batches: 5822, memory footprint: 158MB 89947009882 ns, loops: 200000, batches: 6715, memory footprint: 115MB rcuperf shows approximately ~12% better throughput in case of using "bulk" interface. The "drain logic" or its RCU callback does the work faster that leads to better throughput. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:51 -08:00
Paul E. McKenney	3d05031ae6	rcu: Make nocb_gp_wait() double-check unexpected-callback warning Currently, nocb_gp_wait() unconditionally complains if there is a callback not already associated with a grace period. This assumes that either there was no such callback initially on the one hand, or that the rcu_advance_cbs() function assigned all such callbacks to a grace period on the other. However, in theory there are some situations that would prevent rcu_advance_cbs() from assigning all of the callbacks. This commit therefore checks for unassociated callbacks immediately after rcu_advance_cbs() returns, while the corresponding rcu_node structure's ->lock is still held. If there are unassociated callbacks at that point, the subsequent WARN_ON_ONCE() is disabled. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Paul E. McKenney	13817dd589	rcu: Tighten rcu_lockdep_assert_cblist_protected() check The ->nocb_lock lockdep assertion is currently guarded by cpu_online(), which is incorrect for no-CBs CPUs, whose callback lists must be protected by ->nocb_lock regardless of whether or not the corresponding CPU is online. This situation could result in failure to detect bugs resulting from failing to hold ->nocb_lock for offline CPUs. This commit therefore removes the cpu_online() guard. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Paul E. McKenney	faa059c397	rcu: Optimize and protect atomic_cmpxchg() loop This commit reworks the atomic_cmpxchg() loop in rcu_eqs_special_set() to do only the initial read from the current CPU's rcu_data structure's ->dynticks field explicitly. On subsequent passes, this value is instead retained from the failing atomic_cmpxchg() operation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Jules Irenge	92c0b889f2	rcu/nocb: Add missing annotation for rcu_nocb_bypass_unlock() Sparse reports warning at rcu_nocb_bypass_unlock() warning: context imbalance in rcu_nocb_bypass_unlock() - unexpected unlock The root cause is a missing annotation of rcu_nocb_bypass_unlock() which causes the warning. This commit therefore adds the missing __releases(&rdp->nocb_bypass_lock) annotation. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Boqun Feng <boqun.feng@gmail.com>	2020-02-20 15:58:23 -08:00
Jules Irenge	9ced454807	rcu: Add missing annotation for rcu_nocb_bypass_lock() Sparse reports warning at rcu_nocb_bypass_lock() \|warning: context imbalance in rcu_nocb_bypass_lock() - wrong count at exit To fix this, this commit adds an __acquires(&rdp->nocb_bypass_lock). Given that rcu_nocb_bypass_lock() does actually call raw_spin_lock() when raw_spin_trylock() fails, this not only fixes the warning but also improves on the readability of the code. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Paul E. McKenney	5648d65912	rcu: Don't flag non-starting GPs before GP kthread is running Currently rcu_check_gp_start_stall() complains if a grace period takes too long to start, where "too long" is roughly one RCU CPU stall-warning interval. This has worked well, but there are some debugging Kconfig options (such as CONFIG_EFI_PGT_DUMP=y) that can make booting take a very long time, so much so that the stall-warning interval has expired before RCU's grace-period kthread has even been spawned. This commit therefore resets the rcu_state.gp_req_activity and rcu_state.gp_activity timestamps just before the grace-period kthread is spawned, and modifies the checks and adds ordering to ensure that if rcu_check_gp_start_stall() sees that the grace-period kthread has been spawned, that it will also see the resets applied to the rcu_state.gp_req_activity and rcu_state.gp_activity timestamps. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ paulmck: Fix whitespace issues reported by Qian Cai. ] Tested-by: Qian Cai <cai@lca.pw> [ paulmck: Simplify grace-period wakeup check per Steve Rostedt feedback. ]	2020-02-20 15:58:23 -08:00
Paul E. McKenney	aa24f93753	rcu: Fix rcu_barrier_callback() race condition The rcu_barrier_callback() function does an atomic_dec_and_test(), and if it is the last CPU to check in, does the required wakeup. Either way, it does an event trace. Unfortunately, this is susceptible to the following sequence of events: o CPU 0 invokes rcu_barrier_callback(), but atomic_dec_and_test() says that it is not last. But at this point, CPU 0 is delayed, perhaps due to an NMI, SMI, or vCPU preemption. o CPU 1 invokes rcu_barrier_callback(), and atomic_dec_and_test() says that it is last. So CPU 1 traces completion and does the needed wakeup. o The awakened rcu_barrier() function does cleanup and releases rcu_state.barrier_mutex. o Another CPU now acquires rcu_state.barrier_mutex and starts another round of rcu_barrier() processing, including updating rcu_state.barrier_sequence. o CPU 0 gets its act back together and does its tracing. Except that rcu_state.barrier_sequence has already been updated, so its tracing is incorrect and probably quite confusing. (Wait! Why did this CPU check in twice for one rcu_barrier() invocation???) This commit therefore causes rcu_barrier_callback() to take a snapshot of the value of rcu_state.barrier_sequence before invoking atomic_dec_and_test(), thus guaranteeing that the event-trace output is sensible, even if the timing of the event-trace output might still be confusing. (Wait! Why did the old rcu_barrier() complete before all of its CPUs checked in???) But being that this is RCU, only so much confusion can reasonably be eliminated. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to the mild consequences of the failure, namely a confusing event trace. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Paul E. McKenney	59881bcd85	rcu: Add WRITE_ONCE() to rcu_state ->gp_start The rcu_state structure's ->gp_start field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Paul E. McKenney	57721fd15a	rcu: Remove dead code from rcu_segcblist_insert_pend_cbs() The rcu_segcblist_insert_pend_cbs() function currently (partially) initializes the rcu_cblist that it pulls callbacks from. However, all the resulting stores are dead because all callers pass in the address of an on-stack cblist that is not used afterwards. This commit therefore removes this pointless initialization. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:23 -08:00
Eric Dumazet	90c018942c	timer: Use hlist_unhashed_lockless() in timer_pending() The timer_pending() function is mostly used in lockless contexts, so Without proper annotations, KCSAN might detect a data-race [1]. Using hlist_unhashed_lockless() instead of hand-coding it seems appropriate (as suggested by Paul E. McKenney). [1] BUG: KCSAN: data-race in del_timer / detach_if_pending write to 0xffff88808697d870 of 8 bytes by task 10 on cpu 0: __hlist_del include/linux/list.h:764 [inline] detach_timer kernel/time/timer.c:815 [inline] detach_if_pending+0xcd/0x2d0 kernel/time/timer.c:832 try_to_del_timer_sync+0x60/0xb0 kernel/time/timer.c:1226 del_timer_sync+0x6b/0xa0 kernel/time/timer.c:1365 schedule_timeout+0x2d2/0x6e0 kernel/time/timer.c:1896 rcu_gp_fqs_loop+0x37c/0x580 kernel/rcu/tree.c:1639 rcu_gp_kthread+0x143/0x230 kernel/rcu/tree.c:1799 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 read to 0xffff88808697d870 of 8 bytes by task 12060 on cpu 1: del_timer+0x3b/0xb0 kernel/time/timer.c:1198 sk_stop_timer+0x25/0x60 net/core/sock.c:2845 inet_csk_clear_xmit_timers+0x69/0xa0 net/ipv4/inet_connection_sock.c:523 tcp_clear_xmit_timers include/net/tcp.h:606 [inline] tcp_v4_destroy_sock+0xa3/0x3f0 net/ipv4/tcp_ipv4.c:2096 inet_csk_destroy_sock+0xf4/0x250 net/ipv4/inet_connection_sock.c:836 tcp_close+0x6f3/0x970 net/ipv4/tcp.c:2497 inet_release+0x86/0x100 net/ipv4/af_inet.c:427 __sock_release+0x85/0x160 net/socket.c:590 sock_close+0x24/0x30 net/socket.c:1268 __fput+0x1e1/0x520 fs/file_table.c:280 ____fput+0x1f/0x30 fs/file_table.c:313 task_work_run+0xf6/0x130 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_usermode_loop+0x2b4/0x2c0 arch/x86/entry/common.c:163 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 12060 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ paulmck: Pulled in Eric's later amendments. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	3ca3b0e2cb	rcu: Add *_ONCE() to rcu_node ->boost_kthread_status The rcu_node structure's ->boost_kthread_status field is accessed locklessly, so this commit causes all updates to use WRITE_ONCE() and all reads to use READ_ONCE(). This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	2a2ae872ef	rcu: Add *_ONCE() to rcu_data ->rcu_forced_tick The rcu_data structure's ->rcu_forced_tick field is read locklessly, so this commit adds WRITE_ONCE() to all updates and READ_ONCE() to all lockless reads. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	a5b8950180	rcu: Add READ_ONCE() to rcu_data ->gpwrap The rcu_data structure's ->gpwrap field is read locklessly, and so this commit adds the required READ_ONCE() to a pair of laods in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
SeongJae Park	65bb0dc437	rcu: Fix typos in file-header comments Convert to plural and add a note that this is for Tree RCU. Signed-off-by: SeongJae Park <sjpark@amazon.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	8ff37290d6	rcu: Add *_ONCE() for grace-period progress indicators The various RCU structures' ->gp_seq, ->gp_seq_needed, ->gp_req_activity, and ->gp_activity fields are read locklessly, so they must be updated with WRITE_ONCE() and, when read locklessly, with READ_ONCE(). This commit makes these changes. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	bfeebe2421	rcu: Add READ_ONCE() to rcu_segcblist ->tails[] The rcu_segcblist structure's ->tails[] array entries are read locklessly, so this commit adds the READ_ONCE() to a load in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	0050c7b2d2	locking/rtmutex: rcu: Add WRITE_ONCE() to rt_mutex ->owner The rt_mutex structure's ->owner field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	105abf82b0	rcu: Add WRITE_ONCE() to rcu_node ->qsmaskinitnext The rcu_state structure's ->qsmaskinitnext field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely for systems not doing incessant CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	2906d2154c	rcu: Add WRITE_ONCE() to rcu_state ->gp_req_activity The rcu_state structure's ->gp_req_activity field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	0937d04573	rcu: Add READ_ONCE() to rcu_node ->gp_seq The rcu_node structure's ->gp_seq field is read locklessly, so this commit adds the READ_ONCE() to several loads in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Not appropriate for backporting because this affects only tracing and warnings. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	b0c18c8773	rcu: Add WRITE_ONCE to rcu_node ->exp_seq_rq store The rcu_node structure's ->exp_seq_rq field is read locklessly, so this commit adds the WRITE_ONCE() to a load in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	7672d647dd	rcu: Add WRITE_ONCE() to rcu_node ->qsmask update The rcu_node structure's ->qsmask field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	8a7e8f5171	rcu: Provide debug symbols and line numbers in KCSAN runs This commit adds "-g -fno-omit-frame-pointer" to ease interpretation of KCSAN output, but only for CONFIG_KCSAN=y kerrnels. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:21 -08:00
Paul E. McKenney	24bb9eccf7	rcu: Fix exp_funnel_lock()/rcu_exp_wait_wake() datarace The rcu_node structure's ->exp_seq_rq field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:21 -08:00
Paul E. McKenney	82dd8419e2	rcu: Warn on for_each_leaf_node_cpu_mask() from non-leaf The for_each_leaf_node_cpu_mask() and for_each_leaf_node_possible_cpu() macros must be invoked only on leaf rcu_node structures. Failing to abide by this restriction can result in infinite loops on systems with more than 64 CPUs (or for more than 32 CPUs on 32-bit systems). This commit therefore adds WARN_ON_ONCE() calls to make misuse of these two macros easier to debug. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:21 -08:00
Masami Hiramatsu	d8a953ddde	bootconfig: Set CONFIG_BOOT_CONFIG=n by default Set CONFIG_BOOT_CONFIG=n by default. This also warns user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given in the kernel command line. Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 17:52:12 -05:00
Masami Hiramatsu	7ab215f22d	tracing: Clear trace_state when starting trace Clear trace_state data structure when starting trace in __synth_event_trace_start() internal function. Currently trace_state is initialized only in the synth_event_trace_start() API, but the trace_state in synth_event_trace() and synth_event_trace_array() are on the stack without initialization. This means those APIs will see wrong parameters and wil skip closing process in __synth_event_trace_end() because trace_state->disabled may be !0. Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2 Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 17:48:59 -05:00
Steven Rostedt (VMware)	78041c0c9e	tracing: Disable trace_printk() on post poned tests The tracing seftests checks various aspects of the tracing infrastructure, and one is filtering. If trace_printk() is active during a self test, it can cause the filtering to fail, which will disable that part of the trace. To keep the selftests from failing because of trace_printk() calls, trace_printk() checks the variable tracing_selftest_running, and if set, it does not write to the tracing buffer. As some tracers were registered earlier in boot, the selftest they triggered would fail because not all the infrastructure was set up for the full selftest. Thus, some of the tests were post poned to when their infrastructure was ready (namely file system code). The postpone code did not set the tracing_seftest_running variable, and could fail if a trace_printk() was added and executed during their run. Cc: stable@vger.kernel.org Fixes: `9afecfbb95` ("tracing: Postpone tracer start-up tests till the system is more robust") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 17:43:58 -05:00
Steven Rostedt (VMware)	3c18a9be7c	tracing: Have synthetic event test use raw_smp_processor_id() The test code that tests synthetic event creation pushes in as one of its test fields the current CPU using "smp_processor_id()". As this is just something to see if the value is correctly passed in, and the actual CPU used does not matter, use raw_smp_processor_id(), otherwise with debug preemption enabled, a warning happens as the smp_processor_id() is called without preemption enabled. Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 17:43:41 -05:00
Tom Zanussi	784bd0847e	tracing: Fix number printing bug in print_synth_event() Fix a varargs-related bug in print_synth_event() which resulted in strange output and oopses on 32-bit x86 systems. The problem is that trace_seq_printf() expects the varargs to match the format string, but print_synth_event() was always passing u64 values regardless. This results in unspecified behavior when unpacking with va_arg() in trace_seq_printf(). Add a function that takes the size into account when calling trace_seq_printf(). Before: modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null) After: insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598 Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 16:24:17 -05:00
Tom Zanussi	3843083772	tracing: Check that number of vals matches number of synth event fields Commit 7276531d4036('tracing: Consolidate trace() functions') inadvertently dropped the synth_event_trace() and synth_event_trace_array() checks that verify the number of values passed in matches the number of fields in the synthetic event being traced, so add them back. Link: http://lkml.kernel.org/r/32819cac708714693669e0dfe10fe9d935e94a16.1581720155.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 16:24:17 -05:00
Tom Zanussi	1d9d4c9019	tracing: Make synth_event trace functions endian-correct synth_event_trace(), synth_event_trace_array() and __synth_event_add_val() write directly into the trace buffer and need to take endianness into account, like trace_event_raw_event_synth() does. Link: http://lkml.kernel.org/r/2011354355e405af9c9d28abba430d1f5ff7771a.1581720155.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 16:24:17 -05:00
Tom Zanussi	279eef0531	tracing: Make sure synth_event_trace() example always uses u64 synth_event_trace() is the varargs version of synth_event_trace_array(), which takes an array of u64, as do synth_event_add_val() et al. To not only be consistent with those, but also to address the fact that synth_event_trace() expects every arg to be of the same type since it doesn't also pass in e.g. a format string, the caller needs to make sure all args are of the same type, u64. u64 is used because it needs to accomodate the largest type available in synthetic events, which is u64. This fixes the bug reported by the kernel test robot/Rong Chen. Link: https://lore.kernel.org/lkml/20200212113444.GS12867@shao2-debian/ Link: http://lkml.kernel.org/r/894c4e955558b521210ee0642ba194a9e603354c.1581720155.git.zanussi@kernel.org Fixes: `9fe41efaca` ("tracing: Add synth event generation test module") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-02-20 16:24:17 -05:00
Morten Rasmussen	000619680c	sched/fair: Remove wake_cap() Capacity-awareness in the wake-up path previously involved disabling wake_affine in certain scenarios. We have just made select_idle_sibling() capacity-aware, so this isn't needed anymore. Remove wake_cap() entirely. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> [Changelog tweaks] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Changelog tweaks] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com	2020-02-20 21:03:15 +01:00
Valentin Schneider	f8459197e7	sched/core: Remove for_each_lower_domain() The last remaining user of this macro has just been removed, get rid of it. Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com	2020-02-20 21:03:15 +01:00
Morten Rasmussen	a526d46679	sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems SD_BALANCE_WAKE was previously added to lower sched_domain levels on asymmetric CPU capacity systems by commit: `9ee1cda5ee` ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems") to enable the use of find_idlest_cpu() and friends to find an appropriate CPU for tasks. That responsibility has now been shifted to select_idle_sibling() and friends, and hence the flag can be removed. Note that this causes asymmetric CPU capacity systems to no longer enter the slow wakeup path (find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned with all other mainline topologies). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> [Changelog tweaks] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com	2020-02-20 21:03:14 +01:00
Morten Rasmussen	b7a331615d	sched/fair: Add asymmetric CPU capacity wakeup scan Issue ===== On asymmetric CPU capacity topologies, we currently rely on wake_cap() to drive select_task_rq_fair() towards either: - its slow-path (find_idlest_cpu()) if either the previous or current (waking) CPU has too little capacity for the waking task - its fast-path (select_idle_sibling()) otherwise Commit: `3273163c67` ("sched/fair: Let asymmetric CPU configurations balance at wake-up") points out that this relies on the assumption that "[...]the CPU capacities within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous". This assumption no longer holds on newer generations of big.LITTLE systems (DynamIQ), which can accommodate CPUs of different compute capacity within a single LLC domain. To hopefully paint a better picture, a regular big.LITTLE topology would look like this: +---------+ +---------+ \| L2 \| \| L2 \| +----+----+ +----+----+ \|CPU0\|CPU1\| \|CPU2\|CPU3\| +----+----+ +----+----+ ^^^ ^^^ LITTLEs bigs which would result in the following scheduler topology: DIE [ ] <- sd_asym_cpucapacity MC [ ] [ ] <- sd_llc 0 1 2 3 Conversely, a DynamIQ topology could look like: +-------------------+ \| L3 \| +----+----+----+----+ \| L2 \| L2 \| L2 \| L2 \| +----+----+----+----+ \|CPU0\|CPU1\|CPU2\|CPU3\| +----+----+----+----+ ^^^^^ ^^^^^ LITTLEs bigs which would result in the following scheduler topology: MC [ ] <- sd_llc, sd_asym_cpucapacity 0 1 2 3 What this means is that, on DynamIQ systems, we could pass the wake_cap() test (IOW presume the waking task fits on the CPU capacities of some LLC domain), thus go through select_idle_sibling(). This function operates on an LLC domain, which here spans both bigs and LITTLEs, so it could very well pick a CPU of too small capacity for the task, despite there being fitting idle CPUs - it very much depends on the CPU iteration order, on which we have absolutely no guarantees capacity-wise. Implementation ============== Introduce yet another select_idle_sibling() helper function that takes CPU capacity into account. The policy is to pick the first idle CPU which is big enough for the task (task_util * margin < cpu_capacity). If no idle CPU is big enough, we pick the idle one with the highest capacity. Unlike other select_idle_sibling() helpers, this one operates on the sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all known CPU capacities in the system. As such, this will work for both "legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain). Note that this limits the scope of select_idle_sibling() to select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain will not be scanned, and no further heuristic will be applied. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com	2020-02-20 21:03:14 +01:00
Scott Wood	82e0516ce3	sched/core: Remove duplicate assignment in sched_tick_remote() A redundant "curr = rq->curr" was added; remove it. Fixes: `ebc0f83c78` ("timers/nohz: Update NOHZ load in remote tick") Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com	2020-02-20 21:03:13 +01:00
Alexandre Belloni	b0c609ab20	PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Fix a mistake in a variable name in a comment. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-02-20 11:58:01 +01:00
David S. Miller	41f57cfde1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2020-02-19 The following pull-request contains BPF updates for your net tree. We've added 10 non-merge commits during the last 10 day(s) which contain a total of 10 files changed, 93 insertions(+), 31 deletions(-). The main changes are: 1) batched bpf hashtab fixes from Brian and Yonghong. 2) various selftests and libbpf fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-19 16:42:35 -08:00
Yonghong Song	b9aff38de2	bpf: Fix a potential deadlock with bpf_map_do_batch Commit `057996380a` ("bpf: Add batch ops to all htab bpf map") added lookup_and_delete batch operation for hash table. The current implementation has bpf_lru_push_free() inside the bucket lock, which may cause a deadlock. syzbot reports: -> #2 (&htab->buckets[i].lock#2){....}: __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159 htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593 __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline] __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266 bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline] bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline] bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499 prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132 __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069 bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585 bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181 generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319 bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348 __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460 __se_sys_bpf kernel/bpf/syscall.c:3355 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&loc_l->lock){....}: check_prev_add kernel/locking/lockdep.c:2475 [inline] check_prevs_add kernel/locking/lockdep.c:2580 [inline] validate_chain kernel/locking/lockdep.c:2970 [inline] __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954 lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159 bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline] bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555 __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374 htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491 bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348 __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456 __se_sys_bpf kernel/bpf/syscall.c:3355 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Possible unsafe locking scenario: CPU0 CPU2 ---- ---- lock(&htab->buckets[i].lock#2); lock(&l->lock); lock(&htab->buckets[i].lock#2); lock(&loc_l->lock); * DEADLOCK * To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0, let us do bpf_lru_push_free() out of the htab bucket lock. This can avoid the above deadlock scenario. Fixes: `057996380a` ("bpf: Add batch ops to all htab bpf map") Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com Suggested-by: Hillf Danton <hdanton@sina.com> Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Brian Vazquez <brianvv@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200219234757.3544014-1-yhs@fb.com	2020-02-19 16:01:25 -08:00

... 3 4 5 6 7 ...

32833 Commits