Pull livepatch updates from Jiri Kosina:
- a per-task consistency model is being added for architectures that
support reliable stack dumping (extending this, currently rather
trivial set, is currently in the works).
This extends the nature of the types of patches that can be applied
by live patching infrastructure. The code stems from the design
proposal made [1] back in November 2014. It's a hybrid of SUSE's
kGraft and RH's kpatch, combining advantages of both: it uses
kGraft's per-task consistency and syscall barrier switching combined
with kpatch's stack trace switching. There are also a number of
fallback options which make it quite flexible.
Most of the heavy lifting done by Josh Poimboeuf with help from
Miroslav Benes and Petr Mladek
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
- module load time patch optimization from Zhou Chengming
- a few assorted small fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add missing printk newlines
livepatch: Cancel transition a safe way for immediate patches
livepatch: Reduce the time of finding module symbols
livepatch: make klp_mutex proper part of API
livepatch: allow removal of a disabled patch
livepatch: add /proc/<pid>/patch_state
livepatch: change to a per-task consistency model
livepatch: store function sizes
livepatch: use kstrtobool() in enabled_store()
livepatch: move patching functions into patch.c
livepatch: remove unnecessary object loaded check
livepatch: separate enabled and patched states
livepatch/s390: add TIF_PATCH_PENDING thread flag
livepatch/s390: reorganize TIF thread flag bits
livepatch/powerpc: add TIF_PATCH_PENDING thread flag
livepatch/x86: add TIF_PATCH_PENDING thread flag
livepatch: create temporary klp_update_patch_state() stub
x86/entry: define _TIF_ALLWORK_MASK flags explicitly
stacktrace/x86: add function for detecting reliable stack traces
Pull networking updates from David Millar:
"Here are some highlights from the 2065 networking commits that
happened this development cycle:
1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)
2) Add a generic XDP driver, so that anyone can test XDP even if they
lack a networking device whose driver has explicit XDP support
(me).
3) Sparc64 now has an eBPF JIT too (me)
4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
Starovoitov)
5) Make netfitler network namespace teardown less expensive (Florian
Westphal)
6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)
7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)
8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)
9) Multiqueue support in stmmac driver (Joao Pinto)
10) Remove TCP timewait recycling, it never really could possibly work
well in the real world and timestamp randomization really zaps any
hint of usability this feature had (Soheil Hassas Yeganeh)
11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
Aleksandrov)
12) Add socket busy poll support to epoll (Sridhar Samudrala)
13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
and several others)
14) IPSEC hw offload infrastructure (Steffen Klassert)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
tipc: refactor function tipc_sk_recv_stream()
tipc: refactor function tipc_sk_recvmsg()
net: thunderx: Optimize page recycling for XDP
net: thunderx: Support for XDP header adjustment
net: thunderx: Add support for XDP_TX
net: thunderx: Add support for XDP_DROP
net: thunderx: Add basic XDP support
net: thunderx: Cleanup receive buffer allocation
net: thunderx: Optimize CQE_TX handling
net: thunderx: Optimize RBDR descriptor handling
net: thunderx: Support for page recycling
ipx: call ipxitf_put() in ioctl error path
net: sched: add helpers to handle extended actions
qed*: Fix issues in the ptp filter config implementation.
qede: Fix concurrency issue in PTP Tx path processing.
stmmac: Add support for SIMATIC IOT2000 platform
net: hns: fix ethtool_get_strings overflow in hns driver
tcp: fix wraparound issue in tcp_lp
bpf, arm64: fix jit branch offset related to ldimm64
bpf, arm64: implement jiting of BPF_XADD
...
Pull crypto updates from Herbert Xu:
"Here is the crypto update for 4.12:
API:
- Add batch registration for acomp/scomp
- Change acomp testing to non-unique compressed result
- Extend algorithm name limit to 128 bytes
- Require setkey before accept(2) in algif_aead
Algorithms:
- Add support for deflate rfc1950 (zlib)
Drivers:
- Add accelerated crct10dif for powerpc
- Add crc32 in stm32
- Add sha384/sha512 in ccp
- Add 3des/gcm(aes) for v5 devices in ccp
- Add Queue Interface (QI) backend support in caam
- Add new Exynos RNG driver
- Add ThunderX ZIP driver
- Add driver for hardware random generator on MT7623 SoC"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (101 commits)
crypto: stm32 - Fix OF module alias information
crypto: algif_aead - Require setkey before accept(2)
crypto: scomp - add support for deflate rfc1950 (zlib)
crypto: scomp - allow registration of multiple scomps
crypto: ccp - Change ISR handler method for a v5 CCP
crypto: ccp - Change ISR handler method for a v3 CCP
crypto: crypto4xx - rename ce_ring_contol to ce_ring_control
crypto: testmgr - Allow ecb(cipher_null) in FIPS mode
Revert "crypto: arm64/sha - Add constant operand modifier to ASM_EXPORT"
crypto: ccp - Disable interrupts early on unload
crypto: ccp - Use only the relevant interrupt bits
hwrng: mtk - Add driver for hardware random generator on MT7623 SoC
dt-bindings: hwrng: Add Mediatek hardware random generator bindings
crypto: crct10dif-vpmsum - Fix missing preempt_disable()
crypto: testmgr - replace compression known answer test
crypto: acomp - allow registration of multiple acomps
hwrng: n2 - Use devm_kcalloc() in n2rng_probe()
crypto: chcr - Fix error handling related to 'chcr_alloc_shash'
padata: get_next is never NULL
crypto: exynos - Add new Exynos RNG driver
...
Pull splice updates from Al Viro:
"These actually missed the last cycle; the branch itself is from last
December"
* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make nr_pages calculation in default_file_splice_read() a bit less ugly
splice/tee/vmsplice: validate flags
splice_pipe_desc: kill ->flags
remove spd_release_page()
Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:
- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)
- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)
- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)
- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)
- ... plus misc updates, fixes and cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...
Pull x86 boot updates from Ingo Molnar:
"The biggest changes in this cycle were:
- reworking of the e820 code: separate in-kernel and boot-ABI data
structures and apply a whole range of cleanups to the kernel side.
No change in functionality.
- enable KASLR by default: it's used by all major distros and it's
out of the experimental stage as well.
- ... misc fixes and cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails
x86/reboot: Turn off KVM when halting a CPU
x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
x86: Enable KASLR by default
boot/param: Move next_arg() function to lib/cmdline.c for later reuse
x86/boot: Fix Sparse warning by including required header file
x86/boot/64: Rename start_cpu()
x86/xen: Update e820 table handling to the new core x86 E820 code
x86/boot: Fix pr_debug() API braindamage
xen, x86/headers: Add <linux/device.h> dependency to <asm/xen/page.h>
x86/boot/e820: Simplify e820__update_table()
x86/boot/e820: Separate the E820 ABI structures from the in-kernel structures
x86/boot/e820: Fix and clean up e820_type switch() statements
x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix
x86/boot/e820: Remove unnecessary #include's
x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions()
x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*()
x86/boot/e820: Use bool in query APIs
x86/boot/e820: Document e820__reserve_setup_data()
x86/boot/e820: Clean up __e820__update_table() et al
...
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:
Kernel side changes:
- Kprobes and uprobes changes:
- Make their trampolines read-only while they are used
- Make UPROBES_EVENTS default-y which is the distro practice
- Apply misc fixes and robustization to probe point insertion.
- add support for AMD IOMMU events
- extend hw events on Intel Goldmont CPUs
- ... plus misc fixes and updates.
Tooling side changes:
- support s390 jump instructions in perf annotate (Christian
Borntraeger)
- vendor hardware events updates (Andi Kleen)
- add argument support for SDT events in powerpc (Ravi Bangoria)
- beautify the statx syscall arguments in 'perf trace' (Arnaldo
Carvalho de Melo)
- handle inline functions in callchains (Jin Yao)
- enable sorting by srcline as key (Milian Wolff)
- add 'brstackinsn' field in 'perf script' to reuse the x86
instruction decoder used in the Intel PT code to study hot paths to
samples (Andi Kleen)
- add PERF_RECORD_NAMESPACES so that the kernel can record
information required to associate samples to namespaces, helping in
container problem characterization. (Hari Bathini)
- allow sorting by symbol_size in 'perf report' and 'perf top'
(Charles Baylis)
- in perf stat, make system wide (-a) the default option if no target
was specified and one of following conditions is met:
- no workload specified (current behaviour)
- a workload is specified but all requested events are system wide
ones, like uncore ones. (Jiri Olsa)
- ... plus lots of other updates, enhancements, cleanups and fixes"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (235 commits)
perf tools: Fix the code to strip command name
tools arch x86: Sync cpufeatures.h
tools arch: Sync arch/x86/lib/memcpy_64.S with the kernel
tools: Update asm-generic/mman-common.h copy from the kernel
perf tools: Use just forward declarations for struct thread where possible
perf tools: Add the right header to obtain PERF_ALIGN()
perf tools: Remove poll.h and wait.h from util.h
perf tools: Remove string.h, unistd.h and sys/stat.h from util.h
perf tools: Remove stale prototypes from builtin.h
perf tools: Remove string.h from util.h
perf tools: Remove sys/ioctl.h from util.h
perf tools: Remove a few more needless includes from util.h
perf tools: Include sys/param.h where needed
perf callchain: Move callchain specific routines from util.[ch]
perf tools: Add compress.h for the *_decompress_to_file() headers
perf mem: Fix display of data source snoop indication
perf debug: Move dump_stack() and sighandler_dump_stack() to debug.h
perf kvm: Make function only used by 'perf kvm' static
perf tools: Move timestamp routines from util.h to time-utils.h
perf tools: Move units conversion/formatting routines to separate object
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- a big round of FUTEX_UNLOCK_PI improvements, fixes, cleanups and
general restructuring
- lockdep updates such as new checks for lock_downgrade()
- introduce the new atomic_try_cmpxchg() locking API and use it to
optimize refcount code generation
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
MAINTAINERS: Add FUTEX SUBSYSTEM
futex: Clarify mark_wake_futex memory barrier usage
futex: Fix small (and harmless looking) inconsistencies
futex: Avoid freeing an active timer
rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
rtmutex: Fix more prio comparisons
rtmutex: Fix PI chain order integrity
sched,tracing: Update trace_sched_pi_setprio()
sched/rtmutex: Refactor rt_mutex_setprio()
rtmutex: Clean up
sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
sched/rtmutex/deadline: Fix a PI crash for deadline tasks
rtmutex: Deboost before waking up the top waiter
locking/ww-mutex: Limit stress test to 2 seconds
locking/atomic: Fix atomic_try_cmpxchg() semantics
lockdep: Fix per-cpu static objects
futex: Drop hb->lock before enqueueing on the rtmutex
futex: Futex_unlock_pi() determinism
futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
...
Pull timer updates from Thomas Gleixner:
"The timer departement delivers:
- more year 2038 rework
- a massive rework of the arm achitected timer
- preparatory patches to allow NTP correction of clock event devices
to avoid early expiry
- the usual pile of fixes and enhancements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
timer/sysclt: Restrict timer migration sysctl values to 0 and 1
arm64/arch_timer: Mark errata handlers as __maybe_unused
Clocksource/mips-gic: Remove redundant non devicetree init
MIPS/Malta: Probe gic-timer via devicetree
clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver
clocksource: arm_arch_timer: add GTDT support for memory-mapped timer
acpi/arm64: Add memory-mapped timer support in GTDT driver
clocksource: arm_arch_timer: simplify ACPI support code.
acpi/arm64: Add GTDT table parse driver
clocksource: arm_arch_timer: split MMIO timer probing.
clocksource: arm_arch_timer: add structs to describe MMIO timer
clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call
clocksource: arm_arch_timer: refactor arch_timer_needs_probing
clocksource: arm_arch_timer: split dt-only rate handling
x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks
unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks
um/time: Set ->min_delta_ticks and ->max_delta_ticks
tile/time: Set ->min_delta_ticks and ->max_delta_ticks
score/time: Set ->min_delta_ticks and ->max_delta_ticks
...
Pull irq updates from Thomas Gleixner:
"Nothing exciting from the irq side for this merge window:
- a new driver for a Mediatek SoC
- ACPI support for ARM GICV3
- support for shared nested interrupts
- the usual pile of fixes and updates all over te place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
irqchip/mbigen: Fix return value check in mbigen_device_probe()
irqchip/mips-gic: Replace static map with dynamic
irqchip/mips-gic: Remove device IRQ domain
irqchip/mips-gic: Separate IPI reservation & usage tracking
genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
genirq: Use cpumask_available() for check of cpumask variable
cpumask: Add helper cpumask_available()
irqchip/irq-imx-gpcv2: Clear OF_POPULATED flag
irqchip/atmel-aic5: Handle suspend to RAM
irqchip: Add Mediatek mtk-cirq driver
dt-bindings: mtk-cirq: Add binding document
irqchip/gic-v3-its: Add IORT hook for platform MSI support
irqchip/mbigen: Add ACPI support
irqchip/mbigen: Introduce mbigen_of_create_domain()
irqchip/mbigen: Drop module owner
platform-msi: Make platform_msi_create_device_domain() ACPI aware
irqchip/gicv3-its: platform-msi: Scan MADT to create platform msi domain
irqchip/gicv3-its: platform-msi: Refactor its_pmsi_init() to prepare for ACPI
irqchip/gicv3-its: platform-msi: Refactor its_pmsi_prepare()
irqchip/gic-v3-its: Keep the include header files in alphabetic order
...
Pull uaccess unification updates from Al Viro:
"This is the uaccess unification pile. It's _not_ the end of uaccess
work, but the next batch of that will go into the next cycle. This one
mostly takes copy_from_user() and friends out of arch/* and gets the
zero-padding behaviour in sync for all architectures.
Dealing with the nocache/writethrough mess is for the next cycle;
fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
sold on access_ok() in there, BTW; just not in this pile), same for
reducing __copy_... callsites, strn*... stuff, etc. - there will be a
pile about as large as this one in the next merge window.
This one sat in -next for weeks. -3KLoC"
* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
HAVE_ARCH_HARDENED_USERCOPY is unconditional now
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
m32r: switch to RAW_COPY_USER
hexagon: switch to RAW_COPY_USER
microblaze: switch to RAW_COPY_USER
get rid of padding, switch to RAW_COPY_USER
ia64: get rid of copy_in_user()
ia64: sanitize __access_ok()
ia64: get rid of 'segment' argument of __do_{get,put}_user()
ia64: get rid of 'segment' argument of __{get,put}_user_check()
ia64: add extable.h
powerpc: get rid of zeroing, switch to RAW_COPY_USER
esas2r: don't open-code memdup_user()
alpha: fix stack smashing in old_adjtimex(2)
don't open-code kernel_setsockopt()
mips: switch to RAW_COPY_USER
mips: get rid of tail-zeroing in primitives
mips: make copy_from_user() zero tail explicitly
mips: clean and reorder the forest of macros...
mips: consolidate __invoke_... wrappers
...
- Rework the intel_pstate driver's sysfs interface to make it
more straightforward and more intuitive (Rafael Wysocki).
- Make intel_pstate support all processors which advertise HWP
(hardware-managed P-states) to the kernel in all operation modes
and make it use the load-based P-state selection algorithm on a
wider range of systems in the active mode (Rafael Wysocki).
- Add cpufreq driver for Tegra186 (Mikko Perttunen).
- Add support for Gemini Lake SoCs to intel_pstate (David Box).
- Add support for MT8176 and MT817x to the Mediatek cpufreq driver
and clean up that driver a bit (Daniel Kurtz).
- Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
- Update the schedutil cpufreq governor, mostly to fix a couple of
issues with it related to specific workloads, and rework its sysfs
tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
- Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
(Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
YuanTian Tang).
- Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
Uytterhoeven).
- Add support for "always on" power domains to the genpd (generic
power domains) framework and clean up that code somewhat (Ulf
Hansson, Lina Iyer, Viresh Kumar).
- Fix minor issues in the powernv cpuidle driver and clean it up
(Anton Blanchard, Gautham Shenoy).
- Move the AnalyzeSuspend utility under tools/power/pm-graph/ and
add an analogous boot-profiling utility called AnalyzeBoot to it
(Todd Brandt).
- Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
Scaling) driver (David Wu).
- Fix minor issues in the cpuidle core, the intel_pstate_tracer
utility, the devfreq framework and the PM core documentation
(Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZB5gGAAoJEILEb/54YlRx+78QAJRu+xAA9KtW2+loNyV8iBOB
EFmQLrvz9jCDyYWsHE5huA1k6EVu5QE74HBfgDn4od9s1VqU1zWdEjqKYiaMwlCt
EHxYCZ4YKeF31O3P3CtearBz9IXrckRx/XZ3F1jRsGGWooWv7o3U6PPN9iREmCzi
9dB2j0UD4lCwrnpsDMrJ0GqLu4agn9pcIKDtu4VfszVwYtza0vOQvvlgg1fQS1jX
BnNfaxN0lmpSlxDjtWfM//hfLzEWK8NlHiKWJFPnWFxJIAX1QL2QKnznF/Tqi5N5
el9tQXCTRujlD7BLyl6FdsaowbiUUEcMqeh2k01Vz20WSmNDHIpfrV9MtzJ7biUK
/DopyShUrpgJclKNF7BARJAJc19+PMkv3HMnOnsnhOsBNBpJjsL6FZPA95MMjI0o
xmRQkixl31NWIMk60EIC6PaMLsxjhAWjiYi5D93bzkhnJTnQswvNb4ROPG+X2FCU
6YgBogsYKkqp93TTJ49OFXIvu3o7NwOyEQQW8mnNY8ffaFdWuGzOX4HkOoKHCMTD
rT0qT/2q+7LPK87YwTPIVtpGVltnCr/SVI/FtAGXPysyghu2Z+4GGP4eh2+pSAUj
7Dqxdw3q7zs8ou6LOThTyNJrR+N/w1JPloprBleJR3TNcJjOy/SBjfp7yL0uopIK
j5exr76ImVq7zzjyrYoa
=6M3e
-----END PGP SIGNATURE-----
Merge tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This time the majority of changes go to the cpufreq subsystem (and to
the intel_pstate driver in particular) and there are some updates in
the generic power domains framework, cpuidle, tools and a couple of
other places.
One thing worth mentioning is that the intel_pstate's sysfs interface
has been reworked to be more consistent with the general expectations
of the cpufreq core and less confusing, hopefully for the better.
Also, we have a new cpufreq driver for Tegra186 and new hardware
support in intel_pstata and the Mediatek cpufreq driver.
Apart from that, the AnalyzeSuspend utility for system suspend
profiling gets a companion called AnalyzeBoot for the analogous
profiling of system boot and they both go into one place under
tools/power/pm-graph/.
The rest is mostly fixes, cleanups and code reorganization.
Specifics:
- Rework the intel_pstate driver's sysfs interface to make it more
straightforward and more intuitive (Rafael Wysocki).
- Make intel_pstate support all processors which advertise HWP
(hardware-managed P-states) to the kernel in all operation modes
and make it use the load-based P-state selection algorithm on a
wider range of systems in the active mode (Rafael Wysocki).
- Add cpufreq driver for Tegra186 (Mikko Perttunen).
- Add support for Gemini Lake SoCs to intel_pstate (David Box).
- Add support for MT8176 and MT817x to the Mediatek cpufreq driver
and clean up that driver a bit (Daniel Kurtz).
- Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
- Update the schedutil cpufreq governor, mostly to fix a couple of
issues with it related to specific workloads, and rework its sysfs
tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
- Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
(Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
YuanTian Tang).
- Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
Uytterhoeven).
- Add support for "always on" power domains to the genpd (generic
power domains) framework and clean up that code somewhat (Ulf
Hansson, Lina Iyer, Viresh Kumar).
- Fix minor issues in the powernv cpuidle driver and clean it up
(Anton Blanchard, Gautham Shenoy).
- Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add
an analogous boot-profiling utility called AnalyzeBoot to it (Todd
Brandt).
- Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
Scaling) driver (David Wu).
- Fix minor issues in the cpuidle core, the intel_pstate_tracer
utility, the devfreq framework and the PM core documentation
(Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)"
* tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits)
PM / runtime: Document autosuspend-helper side effects
PM / runtime: Fix autosuspend documentation
tools: power: pm-graph: Package makefile and man pages
tools: power: pm-graph: AnalyzeBoot v2.0
tools: power: pm-graph: AnalyzeSuspend v4.6
cpufreq: Add Tegra186 cpufreq driver
cpufreq: imx6q: Fix error handling code
cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
cpuidle: powernv: Don't continually set thread priority in snooze_loop()
cpuidle: powernv: Don't bounce between low and very low thread priority
cpuidle: cpuidle-cps: remove unused variable
tools/power/x86/intel_pstate_tracer: Adjust directory ownership
cpufreq: schedutil: Use policy-dependent transition delays
cpufreq: schedutil: Reduce frequencies slower
PM / devfreq: Move struct devfreq_governor to devfreq directory
PM / Domains: Ignore domain-idle-states that are not compatible
cpufreq: intel_pstate: Add support for Gemini Lake
powernv-cpuidle: Validate DT property array size
...
Pull cgroup updates from Tejun Heo:
"Nothing major. Two notable fixes are Li's second stab at fixing the
long-standing race condition in the mount path and suppression of
spurious warning from cgroup_get(). All other changes are trivial"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark cgroup_get() with __maybe_unused
cgroup: avoid attaching a cgroup root to two different superblocks, take 2
cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup: move cgroup_subsys_state parent field for cache locality
cpuset: Remove cpuset_update_active_cpus()'s parameter.
cgroup: switch to BUG_ON()
cgroup: drop duplicate header nsproxy.h
kernel: convert css_set.refcount from atomic_t to refcount_t
kernel: convert cgroup_namespace.count from atomic_t to refcount_t
Pull workqueue update from Tejun Heo:
"One trivial patch to use setup_deferrable_timer() instead of
open-coding the initialization"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use setup_deferrable_timer
a590b90d47 ("cgroup: fix spurious warnings on cgroup_is_dead() from
cgroup_sk_alloc()") converted most cgroup_get() usages to
cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
unused warning for cgroup_get().
Silence the warning by adding __maybe_unused to cgroup_get().
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
llvm 4.0 and above generates the code like below:
....
440: (b7) r1 = 15
441: (05) goto pc+73
515: (79) r6 = *(u64 *)(r10 -152)
516: (bf) r7 = r10
517: (07) r7 += -112
518: (bf) r2 = r7
519: (0f) r2 += r1
520: (71) r1 = *(u8 *)(r8 +0)
521: (73) *(u8 *)(r2 +45) = r1
....
and the verifier complains "R2 invalid mem access 'inv'" for insn #521.
This is because verifier marks register r2 as unknown value after #519
where r2 is a stack pointer and r1 holds a constant value.
Teach verifier to recognize "stack_ptr + imm" and
"stack_ptr + reg with const val" as valid stack_ptr with new offset.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.
The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)
But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.
One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)
So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:
Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.
This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.
This speeds up things while also making the pmem refcounting more robust going
forward.
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit bfb0b80db5 ("cgroup: avoid attaching a cgroup root to two
different superblocks") is broken. Now we try to fix the race by
delaying the initialization of cgroup root refcnt until a superblock
has been allocated.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Hannes rightfully spotted that the bpf_lock doesn't need to be
irqsave variant. We never perform any such updates where this
would be necessary (neither right now nor in future), therefore
relax this further.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup_get() expected to be called only on live cgroups and triggers
warning on a dead cgroup; however, cgroup_sk_alloc() may be called
while cloning a socket which is left in an empty and removed cgroup
and thus may legitimately duplicate its reference on a dead cgroup.
This currently triggers the following warning spuriously.
WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
...
[<ffffffff8107e123>] __warn+0xd3/0xf0
[<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
[<ffffffff810ff465>] cgroup_get+0x55/0x60
[<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
[<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
[<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
[<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
[<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
[<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
[<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
[<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
[<ffffffff81837cb2>] ip6_input+0x32/0xb0
[<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
[<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
[<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
[<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
[<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
[<ffffffff817787d8>] napi_gro_frags+0x208/0x270
[<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
[<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
[<ffffffff81778b30>] net_rx_action+0x210/0x350
[<ffffffff8188c426>] __do_softirq+0x106/0x2c7
[<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
[<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
[<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
[<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
[<ffffffff8103edd1>] start_secondary+0xf1/0x100
This patch renames the existing cgroup_get() with the dead cgroup
warning to cgroup_get_live() after cgroup_kn_lock_live() and
introduces the new cgroup_get() which doesn't check whether the cgroup
is live or dead.
All existing cgroup_get() users except for cgroup_sk_alloc() are
converted to use cgroup_get_live().
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Cc: stable@vger.kernel.org # v4.5+
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.
But this behaviour got broken by:
a499a5a14d ("sched/cputime: Increment kcpustat directly on irqtime account")
... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().
This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.
ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().
To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.
Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.
This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that also the last in-tree user of the xdp_adjust_head bit has
been removed, we can remove the flag from struct bpf_prog altogether.
This, at the same time, also makes sure that any future driver for
XDP comes with bpf_xdp_adjust_head() support right away.
A rejection based on this flag would also mean that tail calls
couldn't be used with such driver as per c2002f9837 ("bpf: fix
checking xdp_adjust_head on tail calls") fix, thus lets not allow
for it in the first place.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull irq fix from Thomas Gleixner:
"The (hopefully) final fix for the irq affinity spreading logic"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Fix calculating vectors to assign
Both conflict were simple overlapping changes.
In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
We are not supposed to add new entries to this thing
any more.
Thanks to Eric Dumazet for noticing this.
Signed-off-by: David S. Miller <davem@davemloft.net>
Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.
For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.
Version 2: changed jiffies to microseconds for predictable units.
Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Per Dan's static checker warning, the code that returns NULL was removed
in 2010, so this patch updates the comments and fixes the code
assumptions.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
One is a race condition when enabling the snapshot function probe
trigger. It enables the probe before allocating the snapshot, and
if the probe triggers first, it stops tracing with a warning that
the snapshot buffer was not allocated.
The seconds is that the snapshot file should show how to use it when
it is empty. But a bug fix from long ago broke the "is empty" test
and the snapshot file no longer displays the help message.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJY+L3dFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
DyQH/j4ZoRhc+XziMw7iJxNvDfptT9XFawqTKDdYJ3nMsFp+40bzlfYah94b1YYQ
YTLnvlxtiYUo1rifOnsdY913IKLc1wtO/a/S8/qqUJ1+7ik46zgaPYqNQlvM6clV
xoJQ6+c631SbJ3KuhadvXTABvzF4Qc1w0/f81lzGgYE8IB2VxiWeYZDMVxe5r2oM
A0seve9C5Wps39m/kcFHSVMZwpk6s7gZL7ERcME4dOewJpQ7b0ufWXMsBssD0bMx
G0ihBdfeM6TzXSTtrnLzU9eZaUtfh37olpvjpJzdIUUqwVpSrxOKmLcsYCIeNs3f
YuS54g7kEsDqLxGJvkC0UBou2rU=
=DQC3
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull two more ftrace fixes from Steven Rostedt:
"While continuing my development, I uncovered two more small bugs.
One is a race condition when enabling the snapshot function probe
trigger. It enables the probe before allocating the snapshot, and if
the probe triggers first, it stops tracing with a warning that the
snapshot buffer was not allocated.
The seconds is that the snapshot file should show how to use it when
it is empty. But a bug fix from long ago broke the "is empty" test and
the snapshot file no longer displays the help message"
* tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Have ring_buffer_iter_empty() return true when empty
tracing: Allocate the snapshot buffer before enabling probe
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The vectors_per_node is calculated from the remaining available vectors.
The current vector starts after pre_vectors, so we need to subtract that
from the current to properly account for the number of remaining vectors
to assign.
Fixes: 3412386b53 ("irq/affinity: Fix extra vecs calculation")
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Link: http://lkml.kernel.org/r/1492645870-13019-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
timer_migration sysctl acts as a boolean switch, so the allowed values
should be restricted to 0 and 1.
Add the necessary extra fields to the sysctl table entry to enforce that.
[ tglx: Rewrote changelog ]
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I noticed that reading the snapshot file when it is empty no longer gives a
status. It suppose to show the status of the snapshot buffer as well as how
to allocate and use it. For example:
># cat snapshot
# tracer: nop
#
#
# * Snapshot is allocated *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
But instead it just showed an empty buffer:
># cat snapshot
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
What happened was that it was using the ring_buffer_iter_empty() function to
see if it was empty, and if it was, it showed the status. But that function
was returning false when it was empty. The reason was that the iter header
page was on the reader page, and the reader page was empty, but so was the
buffer itself. The check only tested to see if the iter was on the commit
page, but the commit page was no longer pointing to the reader page, but as
all pages were empty, the buffer is also.
Cc: stable@vger.kernel.org
Fixes: 651e22f270 ("ring-buffer: Always reset iterator to reader page")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently the snapshot trigger enables the probe and then allocates the
snapshot. If the probe triggers before the allocation, it could cause the
snapshot to fail and turn tracing off. It's best to allocate the snapshot
buffer first, and then enable the trigger. If something goes wrong in the
enabling of the trigger, the snapshot buffer is still allocated, but it can
also be freed by the user by writting zero into the snapshot buffer file.
Also add a check of the return status of alloc_snapshot().
Cc: stable@vger.kernel.org
Fixes: 77fd5c15e3 ("tracing: Add snapshot trigger to function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull sparc fixes from David Miller:
"Two Sparc bug fixes from Daniel Jordan and Nitin Gupta"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix hugepage page table free
sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
Pull networking fixes from David Miller:
1) BPF tail call handling bug fixes from Daniel Borkmann.
2) Fix allowance of too many rx queues in sfc driver, from Bert
Kenward.
3) Non-loopback ipv6 packets claiming src of ::1 should be dropped,
from Florian Westphal.
4) Statistics requests on KSZ9031 can crash, fix from Grygorii
Strashko.
5) TX ring handling fixes in mediatek driver, from Sean Wang.
6) ip_ra_control can deadlock, fix lock acquisition ordering to fix,
from Cong WANG.
7) Fix use after free in ip_recv_error(), from Willem de Buijn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bpf: fix checking xdp_adjust_head on tail calls
bpf: fix cb access in socket filter programs on tail calls
ipv6: drop non loopback packets claiming to originate from ::1
net: ethernet: mediatek: fix inconsistency of port number carried in TXD
net: ethernet: mediatek: fix inconsistency between TXD and the used buffer
net: phy: micrel: fix crash when statistic requested for KSZ9031 phy
net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev rule
net: thunderx: Fix set_max_bgx_per_node for 81xx rgx
net-timestamp: avoid use-after-free in ip_recv_error
ipv4: fix a deadlock in ip_ra_control
sfc: limit the number of receive queues
CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the required 32MB limit, but this
option is not set for every config that enables lockdep.
A 4.10 kernel fails to boot with the console output
Kernel: Using 8 locked TLB entries for main kernel image.
hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
Program terminated
with these config options
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_PROVE_LOCKING=n
To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.
Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'. When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.
Fixes: e6b5f1be7a ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a pid filter to function tracing in an instance, and then freeing
the instance.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJY9hO7FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
qBgIAJv+IH1zQTHqFn4gOtIkHJ0kxjTr9mzz4S5SgnHDMaCKOHTpuste02RmCvfo
J+6F//bw3eM9CpEcQg/t41aFagXs+g3x1HmD0PN7Y1fKHXQ5xDdpjPpOsgprrx7q
dvGLg4bolv6KaNMTJmJ8LhwPXJGMEqnbY6Ypz3qbnsziSeXe1zcrQKNA88ySJoh0
V6QV9XPWNkPO4AknnqD88oZvJhz/H/fQuJYQZNBoTomD6SG3f7mYW1bxyoWc08yW
W+Rg/YddGHk6Mmkqy0BaCPBjKjGiq20h9DOvLU6CFR0Gt4ZQ7sVZczYN4NkjEn7H
qdFcqaHNSkjxs0JFvbWToIu4D8w=
=Gv/C
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Namhyung Kim discovered a use after free bug. It has to do with adding
a pid filter to function tracing in an instance, and then freeing the
instance"
* tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix function pid filter on instances
When function tracer has a pid filter, it adds a probe to sched_switch
to track if current task can be ignored. The probe checks the
ftrace_ignore_pid from current tr to filter tasks. But it misses to
delete the probe when removing an instance so that it can cause a crash
due to the invalid tr pointer (use-after-free).
This is easily reproducible with the following:
# cd /sys/kernel/debug/tracing
# mkdir instances/buggy
# echo $$ > instances/buggy/set_ftrace_pid
# rmdir instances/buggy
============================================================================
BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
Read of size 8 by task kworker/0:1/17
CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G B 4.11.0-rc3 #198
Call Trace:
dump_stack+0x68/0x9f
kasan_object_err+0x21/0x70
kasan_report.part.1+0x22b/0x500
? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
kasan_report+0x25/0x30
__asan_load8+0x5e/0x70
ftrace_filter_pid_sched_switch_probe+0x3d/0x90
? fpid_start+0x130/0x130
__schedule+0x571/0xce0
...
To fix it, use ftrace_clear_pids() to unregister the probe. As
instance_rmdir() already updated ftrace codes, it can just free the
filter safely.
Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org
Fixes: 0c8916c342 ("tracing: Add rmdir to remove multibuffer instances")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
added the xdp_adjust_head bit to the BPF prog in order to tell drivers
that the program that is to be attached requires support for the XDP
bpf_xdp_adjust_head() helper such that drivers not supporting this
helper can reject the program. There are also drivers that do support
the helper, but need to check for xdp_adjust_head bit in order to move
packet metadata prepended by the firmware away for making headroom.
For these cases, the current check for xdp_adjust_head bit is insufficient
since there can be cases where the program itself does not use the
bpf_xdp_adjust_head() helper, but tail calls into another program that
uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
set to 0. Since the first program has no control over which program it
calls into, we need to assume that bpf_xdp_adjust_head() helper is used
upon tail calls. Thus, for the very same reasons in cb_access, set the
xdp_adjust_head bit to 1 when the main program uses tail calls.
Fixes: 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>