Commit Graph

24582 Commits

Author SHA1 Message Date
Linus Torvalds 5db6db0d40 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess unification updates from Al Viro:
 "This is the uaccess unification pile. It's _not_ the end of uaccess
  work, but the next batch of that will go into the next cycle. This one
  mostly takes copy_from_user() and friends out of arch/* and gets the
  zero-padding behaviour in sync for all architectures.

  Dealing with the nocache/writethrough mess is for the next cycle;
  fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
  sold on access_ok() in there, BTW; just not in this pile), same for
  reducing __copy_... callsites, strn*... stuff, etc. - there will be a
  pile about as large as this one in the next merge window.

  This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
  HAVE_ARCH_HARDENED_USERCOPY is unconditional now
  CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
  m32r: switch to RAW_COPY_USER
  hexagon: switch to RAW_COPY_USER
  microblaze: switch to RAW_COPY_USER
  get rid of padding, switch to RAW_COPY_USER
  ia64: get rid of copy_in_user()
  ia64: sanitize __access_ok()
  ia64: get rid of 'segment' argument of __do_{get,put}_user()
  ia64: get rid of 'segment' argument of __{get,put}_user_check()
  ia64: add extable.h
  powerpc: get rid of zeroing, switch to RAW_COPY_USER
  esas2r: don't open-code memdup_user()
  alpha: fix stack smashing in old_adjtimex(2)
  don't open-code kernel_setsockopt()
  mips: switch to RAW_COPY_USER
  mips: get rid of tail-zeroing in primitives
  mips: make copy_from_user() zero tail explicitly
  mips: clean and reorder the forest of macros...
  mips: consolidate __invoke_... wrappers
  ...
2017-05-01 14:41:04 -07:00
Linus Torvalds 0e285e9088 Power management updates for v4.12-rc1
- Rework the intel_pstate driver's sysfs interface to make it
    more straightforward and more intuitive (Rafael Wysocki).
 
  - Make intel_pstate support all processors which advertise HWP
    (hardware-managed P-states) to the kernel in all operation modes
    and make it use the load-based P-state selection algorithm on a
    wider range of systems in the active mode (Rafael Wysocki).
 
  - Add cpufreq driver for Tegra186 (Mikko Perttunen).
 
  - Add support for Gemini Lake SoCs to intel_pstate (David Box).
 
  - Add support for MT8176 and MT817x to the Mediatek cpufreq driver
    and clean up that driver a bit (Daniel Kurtz).
 
  - Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
 
  - Update the schedutil cpufreq governor, mostly to fix a couple of
    issues with it related to specific workloads, and rework its sysfs
    tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
 
  - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
    (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
    YuanTian Tang).
 
  - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
    Uytterhoeven).
 
  - Add support for "always on" power domains to the genpd (generic
    power domains) framework and clean up that code somewhat (Ulf
    Hansson, Lina Iyer, Viresh Kumar).
 
  - Fix minor issues in the powernv cpuidle driver and clean it up
    (Anton Blanchard, Gautham Shenoy).
 
  - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and
    add an analogous boot-profiling utility called AnalyzeBoot to it
    (Todd Brandt).
 
  - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
    Scaling) driver (David Wu).
 
  - Fix minor issues in the cpuidle core, the intel_pstate_tracer
    utility, the devfreq framework and the PM core documentation
    (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZB5gGAAoJEILEb/54YlRx+78QAJRu+xAA9KtW2+loNyV8iBOB
 EFmQLrvz9jCDyYWsHE5huA1k6EVu5QE74HBfgDn4od9s1VqU1zWdEjqKYiaMwlCt
 EHxYCZ4YKeF31O3P3CtearBz9IXrckRx/XZ3F1jRsGGWooWv7o3U6PPN9iREmCzi
 9dB2j0UD4lCwrnpsDMrJ0GqLu4agn9pcIKDtu4VfszVwYtza0vOQvvlgg1fQS1jX
 BnNfaxN0lmpSlxDjtWfM//hfLzEWK8NlHiKWJFPnWFxJIAX1QL2QKnznF/Tqi5N5
 el9tQXCTRujlD7BLyl6FdsaowbiUUEcMqeh2k01Vz20WSmNDHIpfrV9MtzJ7biUK
 /DopyShUrpgJclKNF7BARJAJc19+PMkv3HMnOnsnhOsBNBpJjsL6FZPA95MMjI0o
 xmRQkixl31NWIMk60EIC6PaMLsxjhAWjiYi5D93bzkhnJTnQswvNb4ROPG+X2FCU
 6YgBogsYKkqp93TTJ49OFXIvu3o7NwOyEQQW8mnNY8ffaFdWuGzOX4HkOoKHCMTD
 rT0qT/2q+7LPK87YwTPIVtpGVltnCr/SVI/FtAGXPysyghu2Z+4GGP4eh2+pSAUj
 7Dqxdw3q7zs8ou6LOThTyNJrR+N/w1JPloprBleJR3TNcJjOy/SBjfp7yL0uopIK
 j5exr76ImVq7zzjyrYoa
 =6M3e
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "This time the majority of changes go to the cpufreq subsystem (and to
  the intel_pstate driver in particular) and there are some updates in
  the generic power domains framework, cpuidle, tools and a couple of
  other places.

  One thing worth mentioning is that the intel_pstate's sysfs interface
  has been reworked to be more consistent with the general expectations
  of the cpufreq core and less confusing, hopefully for the better.
  Also, we have a new cpufreq driver for Tegra186 and new hardware
  support in intel_pstata and the Mediatek cpufreq driver.

  Apart from that, the AnalyzeSuspend utility for system suspend
  profiling gets a companion called AnalyzeBoot for the analogous
  profiling of system boot and they both go into one place under
  tools/power/pm-graph/.

  The rest is mostly fixes, cleanups and code reorganization.

  Specifics:

   - Rework the intel_pstate driver's sysfs interface to make it more
     straightforward and more intuitive (Rafael Wysocki).

   - Make intel_pstate support all processors which advertise HWP
     (hardware-managed P-states) to the kernel in all operation modes
     and make it use the load-based P-state selection algorithm on a
     wider range of systems in the active mode (Rafael Wysocki).

   - Add cpufreq driver for Tegra186 (Mikko Perttunen).

   - Add support for Gemini Lake SoCs to intel_pstate (David Box).

   - Add support for MT8176 and MT817x to the Mediatek cpufreq driver
     and clean up that driver a bit (Daniel Kurtz).

   - Clean up intel_pstate and optimize it slightly (Rafael Wysocki).

   - Update the schedutil cpufreq governor, mostly to fix a couple of
     issues with it related to specific workloads, and rework its sysfs
     tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).

   - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
     (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
     YuanTian Tang).

   - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
     Uytterhoeven).

   - Add support for "always on" power domains to the genpd (generic
     power domains) framework and clean up that code somewhat (Ulf
     Hansson, Lina Iyer, Viresh Kumar).

   - Fix minor issues in the powernv cpuidle driver and clean it up
     (Anton Blanchard, Gautham Shenoy).

   - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add
     an analogous boot-profiling utility called AnalyzeBoot to it (Todd
     Brandt).

   - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
     Scaling) driver (David Wu).

   - Fix minor issues in the cpuidle core, the intel_pstate_tracer
     utility, the devfreq framework and the PM core documentation
     (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)"

* tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits)
  PM / runtime: Document autosuspend-helper side effects
  PM / runtime: Fix autosuspend documentation
  tools: power: pm-graph: Package makefile and man pages
  tools: power: pm-graph: AnalyzeBoot v2.0
  tools: power: pm-graph: AnalyzeSuspend v4.6
  cpufreq: Add Tegra186 cpufreq driver
  cpufreq: imx6q: Fix error handling code
  cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
  cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
  cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
  cpuidle: powernv: Don't continually set thread priority in snooze_loop()
  cpuidle: powernv: Don't bounce between low and very low thread priority
  cpuidle: cpuidle-cps: remove unused variable
  tools/power/x86/intel_pstate_tracer: Adjust directory ownership
  cpufreq: schedutil: Use policy-dependent transition delays
  cpufreq: schedutil: Reduce frequencies slower
  PM / devfreq: Move struct devfreq_governor to devfreq directory
  PM / Domains: Ignore domain-idle-states that are not compatible
  cpufreq: intel_pstate: Add support for Gemini Lake
  powernv-cpuidle: Validate DT property array size
  ...
2017-05-01 14:09:46 -07:00
Linus Torvalds 9410091dd5 Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing major. Two notable fixes are Li's second stab at fixing the
  long-standing race condition in the mount path and suppression of
  spurious warning from cgroup_get(). All other changes are trivial"

* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark cgroup_get() with __maybe_unused
  cgroup: avoid attaching a cgroup root to two different superblocks, take 2
  cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
  cgroup: move cgroup_subsys_state parent field for cache locality
  cpuset: Remove cpuset_update_active_cpus()'s parameter.
  cgroup: switch to BUG_ON()
  cgroup: drop duplicate header nsproxy.h
  kernel: convert css_set.refcount from atomic_t to refcount_t
  kernel: convert cgroup_namespace.count from atomic_t to refcount_t
2017-05-01 13:52:24 -07:00
Linus Torvalds ad1490bcd2 Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue update from Tejun Heo:
 "One trivial patch to use setup_deferrable_timer() instead of
  open-coding the initialization"

* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: use setup_deferrable_timer
2017-05-01 13:49:27 -07:00
Jiri Kosina a0841609f6 Merge branches 'for-4.12/upstream' and 'for-4.12/klp-hybrid-consistency-model' into for-linus 2017-05-01 21:49:28 +02:00
Tejun Heo 310b4816a5 cgroup: mark cgroup_get() with __maybe_unused
a590b90d47 ("cgroup: fix spurious warnings on cgroup_is_dead() from
cgroup_sk_alloc()") converted most cgroup_get() usages to
cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
cgroup_get().  When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
unused warning for cgroup_get().

Silence the warning by adding __maybe_unused to cgroup_get().

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-01 15:24:14 -04:00
Linus Torvalds 694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Yonghong Song 332270fdc8 bpf: enhance verifier to understand stack pointer arithmetic
llvm 4.0 and above generates the code like below:
....
440: (b7) r1 = 15
441: (05) goto pc+73
515: (79) r6 = *(u64 *)(r10 -152)
516: (bf) r7 = r10
517: (07) r7 += -112
518: (bf) r2 = r7
519: (0f) r2 += r1
520: (71) r1 = *(u8 *)(r8 +0)
521: (73) *(u8 *)(r2 +45) = r1
....
and the verifier complains "R2 invalid mem access 'inv'" for insn #521.
This is because verifier marks register r2 as unknown value after #519
where r2 is a stack pointer and r1 holds a constant value.

Teach verifier to recognize "stack_ptr + imm" and
"stack_ptr + reg with const val" as valid stack_ptr with new offset.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 11:40:23 -04:00
Steven Rostedt (VMware) 73a757e631 ring-buffer: Return reader page back into existing ring buffer
When reading the ring buffer for consuming, it is optimized for splice,
where a page is taken out of the ring buffer (zero copy) and sent to the
reading consumer. When the read is finished with the page, it calls
ring_buffer_free_read_page(), which simply frees the page. The next time the
reader needs to get a page from the ring buffer, it must call
ring_buffer_alloc_read_page() which allocates and initializes a reader page
for the ring buffer to be swapped into the ring buffer for a new filled page
for the reader.

The problem is that there's no reason to actually free the page when it is
passed back to the ring buffer. It can hold it off and reuse it for the next
iteration. This completely removes the interaction with the page_alloc
mechanism.

Using the trace-cmd utility to record all events (causing trace-cmd to
require reading lots of pages from the ring buffer, and calling
ring_buffer_alloc/free_read_page() several times), and also assigning a
stack trace trigger to the mm_page_alloc event, we can see how many times
the ring_buffer_alloc_read_page() needed to allocate a page for the ring
buffer.

Before this change:

  # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
  # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
  9968

After this change:

  # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
  # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
  4

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-01 10:26:40 -04:00
Dan Williams 7138970383 mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.

The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)

But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.

One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)

So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

This speeds up things while also making the pmem refcounting more robust going
forward.

Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-01 09:15:53 +02:00
Zefan Li 9732adc5d6 cgroup: avoid attaching a cgroup root to two different superblocks, take 2
Commit bfb0b80db5 ("cgroup: avoid attaching a cgroup root to two
different superblocks") is broken.  Now we try to fix the race by
delaying the initialization of cgroup root refcnt until a superblock
has been allocated.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-28 18:04:54 -04:00
Rafael J. Wysocki 0807ee0f52 Merge branch 'pm-cpufreq'
* pm-cpufreq: (37 commits)
  cpufreq: Add Tegra186 cpufreq driver
  cpufreq: imx6q: Fix error handling code
  cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
  cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
  cpufreq: schedutil: Use policy-dependent transition delays
  cpufreq: schedutil: Reduce frequencies slower
  cpufreq: intel_pstate: Add support for Gemini Lake
  cpufreq: intel_pstate: Eliminate intel_pstate_get_min_max()
  cpufreq: intel_pstate: Do not walk policy->cpus
  cpufreq: intel_pstate: Introduce pid_in_use()
  cpufreq: intel_pstate: Drop struct cpu_defaults
  cpufreq: intel_pstate: Move cpu_defaults definitions
  cpufreq: intel_pstate: Add update_util callback to pstate_funcs
  cpufreq: intel_pstate: Use different utilization update callbacks
  cpufreq: intel_pstate: Modify check in intel_pstate_update_status()
  cpufreq: intel_pstate: Drop driver_registered variable
  cpufreq: intel_pstate: Skip unnecessary PID resets on init
  cpufreq: intel_pstate: Set HWP sampling interval once
  cpufreq: intel_pstate: Clean up intel_pstate_busy_pid_reset()
  cpufreq: intel_pstate: Fold intel_pstate_reset_all_pid() into the caller
  ...
2017-04-28 23:14:00 +02:00
Rafael J. Wysocki 2addac72af Merge schedutil governor updates for v4.12. 2017-04-28 23:13:33 +02:00
Hannes Frederic Sowa d24f7c7fb9 bpf: bpf_lock on kallsysms doesn't need to be irqsave
Hannes rightfully spotted that the bpf_lock doesn't need to be
irqsave variant. We never perform any such updates where this
would be necessary (neither right now nor in future), therefore
relax this further.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 15:48:14 -04:00
Tejun Heo a590b90d47 cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup_get() expected to be called only on live cgroups and triggers
warning on a dead cgroup; however, cgroup_sk_alloc() may be called
while cloning a socket which is left in an empty and removed cgroup
and thus may legitimately duplicate its reference on a dead cgroup.
This currently triggers the following warning spuriously.

 WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
 ...
  [<ffffffff8107e123>] __warn+0xd3/0xf0
  [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
  [<ffffffff810ff465>] cgroup_get+0x55/0x60
  [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
  [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
  [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
  [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
  [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
  [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
  [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
  [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
  [<ffffffff81837cb2>] ip6_input+0x32/0xb0
  [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
  [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
  [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
  [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
  [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
  [<ffffffff817787d8>] napi_gro_frags+0x208/0x270
  [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
  [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
  [<ffffffff81778b30>] net_rx_action+0x210/0x350
  [<ffffffff8188c426>] __do_softirq+0x106/0x2c7
  [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
  [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
  [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
  [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
  [<ffffffff8103edd1>] start_secondary+0xf1/0x100

This patch renames the existing cgroup_get() with the dead cgroup
warning to cgroup_get_live() after cgroup_kn_lock_live() and
introduces the new cgroup_get() which doesn't check whether the cgroup
is live or dead.

All existing cgroup_get() users except for cgroup_sk_alloc() are
converted to use cgroup_get_live().

Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Cc: stable@vger.kernel.org # v4.5+
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-28 15:28:20 -04:00
Frederic Weisbecker 25e2d8c1b9 sched/cputime: Fix ksoftirqd cputime accounting regression
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.

But this behaviour got broken by:

  a499a5a14d ("sched/cputime: Increment kcpustat directly on irqtime account")

... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().

This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.

ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().

To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.

Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-27 09:08:26 +02:00
David S. Miller b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Teng Qin 8fe4592438 bpf: map_get_next_key to return first key on NULL
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:57:45 -04:00
Naveen N. Rao 1758618827 kallsyms: Use bounded strnchr() when parsing string
When parsing for the <module:name> format, we use strchr() to look for
the separator, when we know that the module name can't be longer than
MODULE_NAME_LEN. Enforce the same using strnchr().

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-04-24 14:07:28 -07:00
Daniel Borkmann e390b55d5a bpf: make bpf_xdp_adjust_head support mandatory
Now that also the last in-tree user of the xdp_adjust_head bit has
been removed, we can remove the flag from struct bpf_prog altogether.

This, at the same time, also makes sure that any future driver for
XDP comes with bpf_xdp_adjust_head() support right away.

A rejection based on this flag would also mean that tail calls
couldn't be used with such driver as per c2002f9837 ("bpf: fix
checking xdp_adjust_head on tail calls") fix, thus lets not allow
for it in the first place.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 16:18:10 -04:00
Linus Torvalds fa8d7cdc84 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "The (hopefully) final fix for the irq affinity spreading logic"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/affinity: Fix calculating vectors to assign
2017-04-23 12:48:05 -07:00
Eric W. Biederman 6c478ae920 signal: Make kill_proc_info static
There are no users outside of signal.c so make the function static so
the compiler and other developers have that information.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-21 22:46:25 -05:00
David S. Miller fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
Eric W. Biederman cad4ea546b rlimit: Properly call security_task_setrlimit
Modify do_prlimit to call security_task_setrlimit passing the task
whose rlimit we are changing not the tsk->group_leader.

In general this should not matter as the lsms implementing
security_task_setrlimit apparmor and selinux both examine the
task->cred to see what should be allowed on the destination task.

That task->cred is shared between tasks created with CLONE_THREAD
unless thread keyrings are in play, in which case both apparmor and
selinux create duplicate security contexts.

So the only time when it will matter which thread is passed to
security_task_setrlimit is if one of the threads of a process performs
an operation that changes only it's credentials.  At which point if a
thread has done that we don't want to hide that information from the
lsms.

So fix the call of security_task_setrlimit.  With the removal
of tsk->group_leader this makes the code slightly faster,
more comprehensible and maintainable.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-21 16:08:19 -05:00
David S. Miller 1f4407e254 net: Remove NET_CORE_BUDGET_USECS from sysctl binary interface.
We are not supposed to add new entries to this thing
any more.

Thanks to Eric Dumazet for noticing this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:59:52 -04:00
Matthew Whitehead 7acf8a1e8a Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.

Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:22:34 -04:00
Jason A. Donenfeld 69b348449b padata: get_next is never NULL
Per Dan's static checker warning, the code that returns NULL was removed
in 2010, so this patch updates the comments and fixes the code
assumptions.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-04-21 20:30:46 +08:00
Steven Rostedt (VMware) dcc19d2809 tracing/ftrace: Allow for instances to trigger their own stacktrace probes
Have the stacktrace function trigger probe trigger stack traces within the
instance that they were added to in the set_ftrace_filter.

 ># cd /sys/kernel/debug/tracing
 ># mkdir instances/foo
 ># cd instances/foo
 ># echo schedule:stacktrace:1 > set_ftrace_filter
 ># cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 1/1   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
           <idle>-0     [001] .N.2   202.585010: <stack trace>
  =>
  => schedule
  => schedule_preempt_disabled
  => do_idle
  => cpu_startup_entry
  => start_secondary
  => verify_cpu

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:49 -04:00
Steven Rostedt (VMware) 2290f2c589 tracing/ftrace: Allow for the traceonoff probe be unique to instances
Have the traceon/off function probe triggers affect only the instance they
are set in. This required making the trace_on/off accessible for other files
in the tracing directory.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:48 -04:00
Steven Rostedt (VMware) cab5037950 tracing/ftrace: Enable snapshot function trigger to work with instances
Modify the snapshot probe trigger to work with instances. This way the
snapshot function trigger will only affect the instance that it is added to
in the set_ftrace_filter file.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:48 -04:00
Steven Rostedt (VMware) d2afd57a4b tracing/ftrace: Allow instances to have their own function probes
Pass around the local trace_array that is the descriptor for tracing
instances, when enabling and disabling probes. This by default sets the
enable/disable of event probe triggers to work with instances.

The other probes will need some more work to get them working with
instances.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:47 -04:00
Steven Rostedt (VMware) 6e4443199e tracing/ftrace: Add a better way to pass data via the probe functions
With the redesign of the registration and execution of the function probes
(triggers), data can now be passed from the setup of the probe to the probe
callers that are specific to the trace_array it is on. Although, all probes
still only affect the toplevel trace array, this change will allow for
instances to have their own probes separated from other instances and the
top array.

That is, something like the stacktrace probe can be set to trace only in an
instance and not the toplevel trace array. This isn't implement yet, but
this change sets the ground work for the change.

When a probe callback is triggered (someone writes the probe format into
set_ftrace_filter), it calls register_ftrace_function_probe() passing in
init_data that will be used to initialize the probe. Then for every matching
function, register_ftrace_function_probe() will call the probe_ops->init()
function with the init data that was passed to it, as well as an address to
a place holder that is associated with the probe and the instance. The first
occurrence will have a NULL in the pointer. The init() function will then
initialize it. If other probes are added, or more functions are part of the
probe, the place holder will be passed to the init() function with the place
holder data that it was initialized to the last time.

Then this place_holder is passed to each of the other probe_ops functions,
where it can be used in the function callback. When the probe_ops free()
function is called, it can be called either with the rip of the function
that is being removed from the probe, or zero, indicating that there are no
more functions attached to the probe, and the place holder is about to be
freed. This gives the probe_ops a way to free the data it assigned to the
place holder if it was allocade during the first init call.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:46 -04:00
Steven Rostedt (VMware) 7b60f3d876 ftrace: Dynamically create the probe ftrace_ops for the trace_array
In order to eventually have each trace_array instance have its own unique
set of function probes (triggers), the trace array needs to hold the ops and
the filters for the probes.

This is the first step to accomplish this. Instead of having the private
data of the probe ops point to the trace_array, create a separate list that
the trace_array holds. There's only one private_data for a probe, we need
one per trace_array. The probe ftrace_ops will be dynamically created for
each instance, instead of being static.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:46 -04:00
Steven Rostedt (VMware) b5f081b563 tracing: Pass the trace_array into ftrace_probe_ops functions
Pass the trace_array associated to a ftrace_probe_ops into the probe_ops
func(), init() and free() functions. The trace_array is the descriptor that
describes a tracing instance. This will help create the infrastructure that
will allow having function probes unique to tracing instances.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:45 -04:00
Steven Rostedt (VMware) 04ec7bb642 tracing: Have the trace_array hold the list of registered func probes
Add a link list to the trace_array to hold func probes that are registered.
Currently, all function probes are the same for all instances as it was
before, that is, only the top level trace_array holds the function probes.
But this lays the ground work to have function probes be attached to
individual instances, and having the event trigger only affect events in the
given instance. But that work is still to be done.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:45 -04:00
Steven Rostedt (VMware) 8d70725e45 ftrace: If the hash for a probe fails to update then free what was initialized
If the ftrace_hash_move_and_update_ops() fails, and an ops->free() function
exists, then it needs to be called on all the ops that were added by this
registration.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:44 -04:00
Steven Rostedt (VMware) eee8ded131 ftrace: Have the function probes call their own function
Now that the function probes have their own ftrace_ops, there's no reason to
continue using the ftrace_func_hash to find which probe to call in the
function callback. The ops that is passed in to the function callback is
part of the probe_ops to call.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:43 -04:00
Steven Rostedt (VMware) 1ec3a81a0c ftrace: Have each function probe use its own ftrace_ops
Have the function probes have their own ftrace_ops, and remove the
trace_probe_ops. This simplifies some of the ftrace infrastructure code.

Individual entries for each function is still allocated for the use of the
output for set_ftrace_filter, but they will be removed soon too.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:43 -04:00
Steven Rostedt (VMware) d3d532d798 ftrace: Have unregister_ftrace_function_probe_func() return a value
Currently unregister_ftrace_function_probe_func() is a void function. It
does not give any feedback if an error occurred or no item was found to
remove and nothing was done.

Change it to return status and success if it removed something. Also update
the callers to return that feedback to the user.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:42 -04:00
Steven Rostedt (VMware) e16b35ddb8 ftrace: Add helper function ftrace_hash_move_and_update_ops()
The processes of updating a ops filter_hash is a bit complex, and requires
setting up an old hash to perform the update. This is done exactly the same
in two locations for the same reasons. Create a helper function that does it
in one place.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:42 -04:00
Steven Rostedt (VMware) 1a48df0041 ftrace: Remove data field from ftrace_func_probe structure
No users of the function probes uses the data field anymore. Remove it, and
change the init function to take a void *data parameter instead of a
void **data, because the init will just get the data that the registering
function was received, and there's no state after it is called.

The other functions for ftrace_probe_ops still take the data parameter, but
it will currently only be passed NULL. It will stay as a parameter for
future data to be passed to these functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:41 -04:00
Steven Rostedt (VMware) 02b77e2afb ftrace: Remove printing of data in showing of a function probe
None of the probe users uses the data field anymore of the entry. They all
have their own print() function. Remove showing the data field in the
generic function as the data field will be going away.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:40 -04:00
Steven Rostedt (VMware) 78f78e07d5 ftrace: Remove unused unregister_ftrace_function_probe_all() function
There are no users of unregister_ftrace_function_probe_all(). The only probe
function that is used is unregister_ftrace_function_probe_func(). Rename the
internal static function __unregister_ftrace_function_probe() to
unregister_ftrace_function_probe_func() and make it global.

Also remove the PROBE_TEST_FUNC as it would be always set.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:40 -04:00
Steven Rostedt (VMware) 0fe7e7e3f8 ftrace: Remove unused unregister_ftrace_function_probe() function
Nothing calls unregister_ftrace_function_probe(). Remove it as well as the
flag PROBE_TEST_DATA, as this function was the only one to set it.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:39 -04:00
Steven Rostedt (VMware) fe014e24b6 ftrace: Convert the rest of the function trigger over to the mapping functions
As the data pointer for individual ips will soon be removed and no longer
passed to the callback function probe handlers, convert the rest of the function
trigger counters over to the new ftrace_func_mapper helper functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:39 -04:00
Steven Rostedt (VMware) 1a93f8bd19 tracing: Have the snapshot trigger use the mapping helper functions
As the data pointer for individual ips will soon be removed and no longer
passed to the callback function probe handlers, convert the snapshot
trigger counter over to the new ftrace_func_mapper helper functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:38 -04:00
Steven Rostedt (VMware) 41794f1907 ftrace: Added ftrace_func_mapper for function probe triggers
In order to move the ops to the function probes directly, they need a way to
map function ips to their own data without depending on the infrastructure
of the function probes, as the data field will be going away.

New helper functions are added that are based on the ftrace_hash code.
ftrace_func_mapper functions are there to let the probes map ips to their
data. These can be allocated by the probe ops, and referenced in the
function callbacks.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:37 -04:00
Steven Rostedt (VMware) bca6c8d048 ftrace: Pass probe ops to probe function
In preparation to cleaning up the probe function registration code, the
"data" parameter will eventually be removed from the probe->func() call.
Instead it will receive its own "ops" function, in which it can set up its
own data that it needs to map.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:37 -04:00
Steven Rostedt (VMware) e51a989679 ftrace: Remove unused "flags" field from struct ftrace_func_probe
Nothing uses "flags" in the ftrace_func_probe descriptor. Remove it.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:36 -04:00
Steven Rostedt (VMware) 92a68fa047 ftrace: Move the function commands into the tracing directory
As nothing outside the tracing directory uses the function command mechanism,
I'm moving the prototypes out of the include/linux/ftrace.h and into the
local kernel/trace/trace.h header. I plan on making them hook to the
trace_array structure which is local to kernel/trace, and I do not want to
expose it to the rest of the kernel. This requires that the command functions
must also be local to tracing. But luckily nothing else uses them.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:33 -04:00
Linus Torvalds 160062e190 While continuing my development, I uncovered two more small bugs.
One is a race condition when enabling the snapshot function probe
 trigger. It enables the probe before allocating the snapshot, and
 if the probe triggers first, it stops tracing with a warning that
 the snapshot buffer was not allocated.
 
 The seconds is that the snapshot file should show how to use it when
 it is empty. But a bug fix from long ago broke the "is empty" test
 and the snapshot file no longer displays the help message.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJY+L3dFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 DyQH/j4ZoRhc+XziMw7iJxNvDfptT9XFawqTKDdYJ3nMsFp+40bzlfYah94b1YYQ
 YTLnvlxtiYUo1rifOnsdY913IKLc1wtO/a/S8/qqUJ1+7ik46zgaPYqNQlvM6clV
 xoJQ6+c631SbJ3KuhadvXTABvzF4Qc1w0/f81lzGgYE8IB2VxiWeYZDMVxe5r2oM
 A0seve9C5Wps39m/kcFHSVMZwpk6s7gZL7ERcME4dOewJpQ7b0ufWXMsBssD0bMx
 G0ihBdfeM6TzXSTtrnLzU9eZaUtfh37olpvjpJzdIUUqwVpSrxOKmLcsYCIeNs3f
 YuS54g7kEsDqLxGJvkC0UBou2rU=
 =DQC3
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull two more ftrace fixes from Steven Rostedt:
 "While continuing my development, I uncovered two more small bugs.

  One is a race condition when enabling the snapshot function probe
  trigger. It enables the probe before allocating the snapshot, and if
  the probe triggers first, it stops tracing with a warning that the
  snapshot buffer was not allocated.

  The seconds is that the snapshot file should show how to use it when
  it is empty. But a bug fix from long ago broke the "is empty" test and
  the snapshot file no longer displays the help message"

* tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Have ring_buffer_iter_empty() return true when empty
  tracing: Allocate the snapshot buffer before enabling probe
2017-04-20 12:30:10 -07:00
Christoph Hellwig caf7df1227 block: remove the errors field from struct request
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig cee4b7ce3f blktrace: remove the unused block_rq_abort tracepoint
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
David S. Miller 7b9f6da175 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 10:35:33 -04:00
Thomas Gleixner 7a258ff04f Merge branch 'linus' into irq/core
Pick up upstream fixes to avoid conflicts with pending patches.
2017-04-20 16:05:13 +02:00
Keith Busch b72f8051f3 genirq/affinity: Fix calculating vectors to assign
The vectors_per_node is calculated from the remaining available vectors.
The current vector starts after pre_vectors, so we need to subtract that
from the current to properly account for the number of remaining vectors
to assign.

Fixes: 3412386b53 ("irq/affinity: Fix extra vecs calculation")
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Link: http://lkml.kernel.org/r/1492645870-13019-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-20 16:03:09 +02:00
Naveen N. Rao 290e307076 powerpc/kprobes: Fix handling of function offsets on ABIv2
commit 239aeba764 ("perf powerpc: Fix kprobe and kretprobe handling with
kallsyms on ppc64le") changed how we use the offset field in struct kprobe on
ABIv2. perf now offsets from the global entry point if an offset is specified
and otherwise chooses the local entry point.

Fix the same in kernel for kprobe API users. We do this by extending
kprobe_lookup_name() to accept an additional parameter to indicate the offset
specified with the kprobe registration. If offset is 0, we return the local
function entry and return the global entry point otherwise.

With:
  # cd /sys/kernel/debug/tracing/
  # echo "p _do_fork" >> kprobe_events
  # echo "p _do_fork+0x10" >> kprobe_events

before this patch:
  # cat ../kprobes/list
  c0000000000d0748  k  _do_fork+0x8    [DISABLED]
  c0000000000d0758  k  _do_fork+0x18    [DISABLED]
  c0000000000412b0  k  kretprobe_trampoline+0x0    [OPTIMIZED]

and after:
  # cat ../kprobes/list
  c0000000000d04c8  k  _do_fork+0x8    [DISABLED]
  c0000000000d04d0  k  _do_fork+0x10    [DISABLED]
  c0000000000412b0  k  kretprobe_trampoline+0x0    [OPTIMIZED]

Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-20 23:18:55 +10:00
Naveen N. Rao 49e0b4658f kprobes: Convert kprobe_lookup_name() to a function
The macro is now pretty long and ugly on powerpc. In the light of further
changes needed here, convert it to a __weak variant to be over-ridden with a
nicer looking function.

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-20 23:18:54 +10:00
Masami Hiramatsu a460246c70 kprobes: Skip preparing optprobe if the probe is ftrace-based
Skip preparing optprobe if the probe is ftrace-based, since anyway, it
must not be optimized (or already optimized by ftrace).

Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-20 23:18:54 +10:00
Myungho Jung b94bf594cf timer/sysclt: Restrict timer migration sysctl values to 0 and 1
timer_migration sysctl acts as a boolean switch, so the allowed values
should be restricted to 0 and 1.

Add the necessary extra fields to the sysctl table entry to enforce that.

[ tglx: Rewrote changelog ]

Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-20 14:56:59 +02:00
Steven Rostedt (VMware) 78f7a45dac ring-buffer: Have ring_buffer_iter_empty() return true when empty
I noticed that reading the snapshot file when it is empty no longer gives a
status. It suppose to show the status of the snapshot buffer as well as how
to allocate and use it. For example:

 ># cat snapshot
 # tracer: nop
 #
 #
 # * Snapshot is allocated *
 #
 # Snapshot commands:
 # echo 0 > snapshot : Clears and frees snapshot buffer
 # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
 #                      Takes a snapshot of the main buffer.
 # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
 #                      (Doesn't have to be '2' works with any number that
 #                       is not a '0' or '1')

But instead it just showed an empty buffer:

 ># cat snapshot
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 0/0   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |

What happened was that it was using the ring_buffer_iter_empty() function to
see if it was empty, and if it was, it showed the status. But that function
was returning false when it was empty. The reason was that the iter header
page was on the reader page, and the reader page was empty, but so was the
buffer itself. The check only tested to see if the iter was on the commit
page, but the commit page was no longer pointing to the reader page, but as
all pages were empty, the buffer is also.

Cc: stable@vger.kernel.org
Fixes: 651e22f270 ("ring-buffer: Always reset iterator to reader page")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19 21:23:47 -04:00
Steven Rostedt (VMware) df62db5be2 tracing: Allocate the snapshot buffer before enabling probe
Currently the snapshot trigger enables the probe and then allocates the
snapshot. If the probe triggers before the allocation, it could cause the
snapshot to fail and turn tracing off. It's best to allocate the snapshot
buffer first, and then enable the trigger. If something goes wrong in the
enabling of the trigger, the snapshot buffer is still allocated, but it can
also be freed by the user by writting zero into the snapshot buffer file.

Also add a check of the return status of alloc_snapshot().

Cc: stable@vger.kernel.org
Fixes: 77fd5c15e3 ("tracing: Add snapshot trigger to function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19 14:19:08 -04:00
Dave Airlie 856ee92e86 Linux 4.11-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY881cAAoJEHm+PkMAQRiGG4UH+wa2z6Qet36Uc4nXFZuSMYrO
 ErUWs1QpTDDv4a+LE4fgyMvM3j9XqtpfQLy1n70jfD14IqPBhHe4gytasAf+8lg1
 YvddFx0Yl3sygVu3dDBNigWeVDbfwepW59coN0vI5nrMo+wrei8aVIWcFKOxdMuO
 n72u9vuhrkEnLJuQk7SF+t4OQob9McXE3s7QgyRopmlKhKo7mh8On7K2BRI5uluL
 t0j5kZM0a43EUT5rq9xR8f5pgtyfTMG/FO2MuzZn43MJcZcyfmnOP/cTSIvAKA5U
 1i12lxlokYhURNUe+S6jm8A47TrqSRSJxaQJZRlfGJksZ0LJa8eUaLDCviBQEoE=
 =6QWZ
 -----END PGP SIGNATURE-----

Merge tag 'v4.11-rc7' into drm-next

Backmerge Linux 4.11-rc7 from Linus tree, to fix some
conflicts that were causing problems with the rerere cache
in drm-tip.
2017-04-19 11:07:14 +10:00
Linus Torvalds 005882e53d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
 "Two Sparc bug fixes from Daniel Jordan and Nitin Gupta"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix hugepage page table free
  sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
2017-04-18 13:56:51 -07:00
Linus Torvalds 40d9018eb7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) BPF tail call handling bug fixes from Daniel Borkmann.

 2) Fix allowance of too many rx queues in sfc driver, from Bert
    Kenward.

 3) Non-loopback ipv6 packets claiming src of ::1 should be dropped,
    from Florian Westphal.

 4) Statistics requests on KSZ9031 can crash, fix from Grygorii
    Strashko.

 5) TX ring handling fixes in mediatek driver, from Sean Wang.

 6) ip_ra_control can deadlock, fix lock acquisition ordering to fix,
    from Cong WANG.

 7) Fix use after free in ip_recv_error(), from Willem de Buijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  bpf: fix checking xdp_adjust_head on tail calls
  bpf: fix cb access in socket filter programs on tail calls
  ipv6: drop non loopback packets claiming to originate from ::1
  net: ethernet: mediatek: fix inconsistency of port number carried in TXD
  net: ethernet: mediatek: fix inconsistency between TXD and the used buffer
  net: phy: micrel: fix crash when statistic requested for KSZ9031 phy
  net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev rule
  net: thunderx: Fix set_max_bgx_per_node for 81xx rgx
  net-timestamp: avoid use-after-free in ip_recv_error
  ipv4: fix a deadlock in ip_ra_control
  sfc: limit the number of receive queues
2017-04-18 13:24:42 -07:00
Daniel Jordan 395102db44 sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the required 32MB limit, but this
option is not set for every config that enables lockdep.

A 4.10 kernel fails to boot with the console output

    Kernel: Using 8 locked TLB entries for main kernel image.
    hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
    Program terminated

with these config options

    CONFIG_LOCKDEP=y
    CONFIG_LOCK_STAT=y
    CONFIG_PROVE_LOCKING=n

To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.

Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.

Fixes: e6b5f1be7a ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:11:07 -07:00
Steven Rostedt (VMware) ec19b85913 ftrace: Move the probe function into the tracing directory
As nothing outside the tracing directory uses the function probes mechanism,
I'm moving the prototypes out of the include/linux/ftrace.h and into the
local kernel/trace/trace.h header. I plan on making them hook to the
trace_array structure which is local to kernel/trace, and I do not want to
expose it to the rest of the kernel. This requires that the probe functions
must also be local to tracing. But luckily nothing else uses them.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-18 13:49:59 -04:00
Linus Torvalds 0bad6d7e93 Namhyung Kim discovered a use after free bug. It has to do with adding
a pid filter to function tracing in an instance, and then freeing
 the instance.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJY9hO7FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 qBgIAJv+IH1zQTHqFn4gOtIkHJ0kxjTr9mzz4S5SgnHDMaCKOHTpuste02RmCvfo
 J+6F//bw3eM9CpEcQg/t41aFagXs+g3x1HmD0PN7Y1fKHXQ5xDdpjPpOsgprrx7q
 dvGLg4bolv6KaNMTJmJ8LhwPXJGMEqnbY6Ypz3qbnsziSeXe1zcrQKNA88ySJoh0
 V6QV9XPWNkPO4AknnqD88oZvJhz/H/fQuJYQZNBoTomD6SG3f7mYW1bxyoWc08yW
 W+Rg/YddGHk6Mmkqy0BaCPBjKjGiq20h9DOvLU6CFR0Gt4ZQ7sVZczYN4NkjEn7H
 qdFcqaHNSkjxs0JFvbWToIu4D8w=
 =Gv/C
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "Namhyung Kim discovered a use after free bug. It has to do with adding
  a pid filter to function tracing in an instance, and then freeing the
  instance"

* tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix function pid filter on instances
2017-04-18 09:31:51 -07:00
Baoquan He f51b17c8d9 boot/param: Move next_arg() function to lib/cmdline.c for later reuse
next_arg() will be used to parse boot parameters in the x86/boot/compressed code,
so move it to lib/cmdline.c for better code reuse.

No change in functionality.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dan.j.williams@intel.com
Cc: dave.jiang@intel.com
Cc: dyoung@redhat.com
Cc: keescook@chromium.org
Cc: zijun_hu <zijun_hu@htc.com>
Link: http://lkml.kernel.org/r/1492436099-4017-2-git-send-email-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-18 10:37:13 +02:00
Namhyung Kim 1e10486ffe ftrace: Add 'function-fork' trace option
The function-fork option is same as event-fork that it tracks task
fork/exit and set the pid filter properly.  This can be useful if user
wants to trace selected tasks including their children only.

Link: http://lkml.kernel.org/r/20170417024430.21194-3-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-17 17:13:00 -04:00
Namhyung Kim d879d0b8c1 ftrace: Fix function pid filter on instances
When function tracer has a pid filter, it adds a probe to sched_switch
to track if current task can be ignored.  The probe checks the
ftrace_ignore_pid from current tr to filter tasks.  But it misses to
delete the probe when removing an instance so that it can cause a crash
due to the invalid tr pointer (use-after-free).

This is easily reproducible with the following:

  # cd /sys/kernel/debug/tracing
  # mkdir instances/buggy
  # echo $$ > instances/buggy/set_ftrace_pid
  # rmdir instances/buggy

  ============================================================================
  BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
  Read of size 8 by task kworker/0:1/17
  CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G    B           4.11.0-rc3  #198
  Call Trace:
   dump_stack+0x68/0x9f
   kasan_object_err+0x21/0x70
   kasan_report.part.1+0x22b/0x500
   ? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
   kasan_report+0x25/0x30
   __asan_load8+0x5e/0x70
   ftrace_filter_pid_sched_switch_probe+0x3d/0x90
   ? fpid_start+0x130/0x130
   __schedule+0x571/0xce0
   ...

To fix it, use ftrace_clear_pids() to unregister the probe.  As
instance_rmdir() already updated ftrace codes, it can just free the
filter safely.

Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org

Fixes: 0c8916c342 ("tracing: Add rmdir to remove multibuffer instances")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-17 16:44:23 -04:00
Daniel Borkmann c2002f9837 bpf: fix checking xdp_adjust_head on tail calls
Commit 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
added the xdp_adjust_head bit to the BPF prog in order to tell drivers
that the program that is to be attached requires support for the XDP
bpf_xdp_adjust_head() helper such that drivers not supporting this
helper can reject the program. There are also drivers that do support
the helper, but need to check for xdp_adjust_head bit in order to move
packet metadata prepended by the firmware away for making headroom.

For these cases, the current check for xdp_adjust_head bit is insufficient
since there can be cases where the program itself does not use the
bpf_xdp_adjust_head() helper, but tail calls into another program that
uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
set to 0. Since the first program has no control over which program it
calls into, we need to assume that bpf_xdp_adjust_head() helper is used
upon tail calls. Thus, for the very same reasons in cb_access, set the
xdp_adjust_head bit to 1 when the main program uses tail calls.

Fixes: 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:51:57 -04:00
Daniel Borkmann 6b1bb01bcc bpf: fix cb access in socket filter programs on tail calls
Commit ff936a04e5 ("bpf: fix cb access in socket filter programs")
added a fix for socket filter programs such that in i) AF_PACKET the
20 bytes of skb->cb[] area gets zeroed before use in order to not leak
data, and ii) socket filter programs attached to TCP/UDP sockets need
to save/restore these 20 bytes since they are also used by protocol
layers at that time.

The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
only look at the actual attached program to determine whether to zero
or save/restore the skb->cb[] parts. There can be cases where the
actual attached program does not access the skb->cb[], but the program
tail calls into another program which does access this area. In such
a case, the zero or save/restore is currently not performed.

Since the programs we tail call into are unknown at verification time
and can dynamically change, we need to assume that whenever the attached
program performs a tail call, that later programs could access the
skb->cb[], and therefore we need to always set cb_access to 1.

Fixes: ff936a04e5 ("bpf: fix cb access in socket filter programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:51:57 -04:00
Steven Rostedt (VMware) b980b117c9 tracing: Have the trace_event benchmark thread call cond_resched_rcu_qs()
The trace_event benchmark thread runs in kernel space in an infinite loop
while also calling cond_resched() in case anything else wants to schedule
in. Unfortunately, on a PREEMPT kernel, that makes it a nop, in which case,
this will never voluntarily schedule. That will cause synchronize_rcu_tasks()
to forever block on this thread, while it is running.

This is exactly what cond_resched_rcu_qs() is for. Use that instead.

Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-17 15:21:19 -04:00
Martin KaFai Lau 695ba2651a bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4
After doing map_perf_test with a much bigger
BPF_F_NO_COMMON_LRU map, the perf report shows a
lot of time spent in rotating the inactive list (i.e.
__bpf_lru_list_rotate_inactive):
> map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
19644783 (19M/s)
> map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
6283930 (6.28M/s)

By inactive, it usually means the element is not in cache.  Hence,
there is a need to tune the PERCPU_NR_SCANS value.

This patch finds a better number of elements to
scan during each list rotation.  The PERCPU_NR_SCANS (which
is defined the same as PERCPU_FREE_TARGET) decreases
from 16 elements to 4 elements.  This change only
affects the BPF_F_NO_COMMON_LRU map.

The test_lru_dist does not show meaningful difference
between 16 and 4.  Our production L4 load balancer which uses
the LRU map for conntrack-ing also shows little change in cache
hit rate.  Since both benchmark and production data show no
cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
We can consider making it configurable if we find a usecase
later that shows another value works better and/or use
a different rotation strategy.

After this change:
> map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
9240324 (9.2M/s)

i.e. 6.28M/s -> 9.2M/s

The test_lru_dist has not shown meaningful difference:
> test_lru_dist zipf.100k.a1_01.out 4000 1:
nr_misses: 31575 (Before) vs 31566 (After)

> test_lru_dist zipf.100k.a0_01.out 40000 1
nr_misses: 67036 (Before) vs 67031 (After)

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:55:52 -04:00
Rafael J. Wysocki 1b72e7fd30 cpufreq: schedutil: Use policy-dependent transition delays
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).

That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.

Make intel_pstate set transition_delay_us to 500.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-04-17 18:37:27 +02:00
Steven Rostedt (VMware) fcdc712579 ftrace: Fix indexing of t_hash_start() from t_next()
t_hash_start() does not increment *pos, where as t_next() must. But when
t_next() does increment *pos, it must still pass in the original *pos to
t_hash_start() otherwise it will skip the first instance:

 # cd /sys/kernel/debug/tracing
 # echo schedule:traceoff > set_ftrace_filter
 # echo do_IRQ:traceoff > set_ftrace_filter
 # echo call_rcu > set_ftrace_filter
 # cat set_ftrace_filter
call_rcu
schedule:traceoff:unlimited
do_IRQ:traceoff:unlimited

The above called t_hash_start() from t_start() as there was only one
function (call_rcu), but if we add another function:

 # echo xfrm_policy_destroy_rcu >> set_ftrace_filter
 # cat set_ftrace_filter
call_rcu
xfrm_policy_destroy_rcu
do_IRQ:traceoff:unlimited

The "schedule:traceoff" disappears.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-17 10:22:29 -04:00
Eric W. Biederman 01a2197485 posix-timers: Correct sanity check in posix_cpu_nsleep
CPUCLOCK_PID(which_clock) is a pid value from userspace so compare it
against task_pid_vnr, not current->pid.  As task_pid_vnr is in the tasks
pid value in the tasks pid namespace, and current->pid is in the
initial pid namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-16 23:42:50 -05:00
Josh Poimboeuf 77f8f39a2e livepatch: add missing printk newlines
Add missing newlines to some pr_err() strings.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-04-16 22:48:05 +02:00
Linus Torvalds 11c994d9a5 Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "Unfortunately, the commit to fix the cgroup mount race in the previous
  pull request can lead to hangs.

  The original bug has been around for a while and isn't too likely to
  be triggered in usual use cases. Revert the commit for now"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
2017-04-16 11:48:10 -07:00
Linus Torvalds 48538861b9 While rewriting the function probe code, I stumbled over a long standing
bug. This bug has been there sinc function tracing was added way back
 when. But my new development depends on this bug being fixed, and it
 should be fixed regardless as it causes ftrace to disable itself when
 triggered, and a reboot is required to enable it again.
 
 The bug is that the function probe does not disable itself properly
 if there's another probe of its type still enabled. For example:
 
      # cd /sys/kernel/debug/tracing
      # echo schedule:traceoff > set_ftrace_filter
      # echo do_IRQ:traceoff > set_ftrace_filter
      # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
      # echo do_IRQ:traceoff > set_ftrace_filter
 
 The above registers two traceoff probes (one for schedule and one for
 do_IRQ, and then removes do_IRQ. But since there still exists one for
 schedule, it is not done properly. When adding do_IRQ back, the breakage
 in the accounting is noticed by the ftrace self tests, and it causes
 a warning and disables ftrace.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJY8ovvFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 nkAH/jfsXUWIbZ6J0A7+nmGiBdIVwLwG0ZOJClcxjnCSpsNs+FO/0w6ragtIYCi2
 Km+0s/slA5GOddG4Miga/dhtxGhDosyXnxqC+4GmD0maqJGLweJLbmiQ1xhra0hr
 XGDI+SXHM/n22zVkFEbkGXgxMvOHeR+X/sREZo3XmoXRLbc1QVtTEe/8TdlLXwE5
 5Fs07xSQqx4TS7oBxIjipHnbHL/gIktEo0HiEmq73++r42MztIMYZPoV+cXuim37
 C6xO4PxfPN0aRh9W5gdiMnbv6lummVBNQXwpMya0vTbxz/9WeUex8c+lcInQUJgA
 FhQWKaCGyi0UK4Pa2Pz/Dmxuti0=
 =LYLo
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "While rewriting the function probe code, I stumbled over a long
  standing bug. This bug has been there sinc function tracing was added
  way back when. But my new development depends on this bug being fixed,
  and it should be fixed regardless as it causes ftrace to disable
  itself when triggered, and a reboot is required to enable it again.

  The bug is that the function probe does not disable itself properly if
  there's another probe of its type still enabled. For example:

     # cd /sys/kernel/debug/tracing
     # echo schedule:traceoff > set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter
     # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter

  The above registers two traceoff probes (one for schedule and one for
  do_IRQ, and then removes do_IRQ.

  But since there still exists one for schedule, it is not done
  properly. When adding do_IRQ back, the breakage in the accounting is
  noticed by the ftrace self tests, and it causes a warning and disables
  ftrace"

* tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix removing of second function probe
2017-04-16 10:01:34 -07:00
Tejun Heo 330c418638 Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
This reverts commit bfb0b80db5.

Andrei reports CRIU test hangs with the patch applied.  The bug fixed
by the patch isn't too likely to trigger in actual uses.  Revert the
patch for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Link: http://lkml.kernel.org/r/20170414232737.GC20350@outlook.office365.com
2017-04-16 23:17:37 +09:00
David S. Miller 6b6cbc1471 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simply overlapping changes.  In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15 21:16:30 -04:00
Steven Rostedt (VMware) acceb72e90 ftrace: Fix removing of second function probe
When two function probes are added to set_ftrace_filter, and then one of
them is removed, the update to the function locations is not performed, and
the record keeping of the function states are corrupted, and causes an
ftrace_bug() to occur.

This is easily reproducable by adding two probes, removing one, and then
adding it back again.

 # cd /sys/kernel/debug/tracing
 # echo schedule:traceoff > set_ftrace_filter
 # echo do_IRQ:traceoff > set_ftrace_filter
 # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
 # echo do_IRQ:traceoff > set_ftrace_filter

Causes:
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220
 Modules linked in: [...]
 CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 Call Trace:
  dump_stack+0x68/0x9f
  __warn+0x111/0x130
  ? trace_irq_work_interrupt+0xa0/0xa0
  warn_slowpath_null+0x1d/0x20
  ftrace_get_addr_curr+0x143/0x220
  ? __fentry__+0x10/0x10
  ftrace_replace_code+0xe3/0x4f0
  ? ftrace_int3_handler+0x90/0x90
  ? printk+0x99/0xb5
  ? 0xffffffff81000000
  ftrace_modify_all_code+0x97/0x110
  arch_ftrace_update_code+0x10/0x20
  ftrace_run_update_code+0x1c/0x60
  ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0
  register_ftrace_function_probe+0x4b6/0x590
  ? ftrace_startup+0x310/0x310
  ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30
  ? update_stack_state+0x88/0x110
  ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
  ? preempt_count_sub+0x18/0xd0
  ? mutex_lock_nested+0x104/0x800
  ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
  ? __unwind_start+0x1c0/0x1c0
  ? _mutex_lock_nest_lock+0x800/0x800
  ftrace_trace_probe_callback.isra.3+0xc0/0x130
  ? func_set_flag+0xe0/0xe0
  ? __lock_acquire+0x642/0x1790
  ? __might_fault+0x1e/0x20
  ? trace_get_user+0x398/0x470
  ? strcmp+0x35/0x60
  ftrace_trace_onoff_callback+0x48/0x70
  ftrace_regex_write.isra.43.part.44+0x251/0x320
  ? match_records+0x420/0x420
  ftrace_filter_write+0x2b/0x30
  __vfs_write+0xd7/0x330
  ? do_loop_readv_writev+0x120/0x120
  ? locks_remove_posix+0x90/0x2f0
  ? do_lock_file_wait+0x160/0x160
  ? __lock_is_held+0x93/0x100
  ? rcu_read_lock_sched_held+0x5c/0xb0
  ? preempt_count_sub+0x18/0xd0
  ? __sb_start_write+0x10a/0x230
  ? vfs_write+0x222/0x240
  vfs_write+0xef/0x240
  SyS_write+0xab/0x130
  ? SyS_read+0x130/0x130
  ? trace_hardirqs_on_caller+0x182/0x280
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL_64_fastpath+0x18/0xad
 RIP: 0033:0x7fe61c157c30
 RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30
 RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001
 RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700
 R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400
 R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c
  ? trace_hardirqs_off_caller+0xc0/0x110
 ---[ end trace 99fa09b3d9869c2c ]---
 Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150)

Cc: stable@vger.kernel.org
Fixes: 59df055f19 ("ftrace: trace different functions with a different tracer")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-15 17:04:37 -04:00
Darren Hart (VMware) 38fcd06e9b futex: Clarify mark_wake_futex memory barrier usage
Clarify the scenario described in mark_wake_futex requiring the
smp_store_release(). Update the comment to explicitly refer to the
plist_del now under __unqueue_futex() (previously plist_del was in the
same function as the comment).

Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170414223138.GA4222@fury
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-15 16:03:46 +02:00
Hans de Goede 382bd4de61 genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
When requesting a shared irq with IRQF_TRIGGER_NONE then the irqaction
flags get filled with the trigger type from the irq_data:

        if (!(new->flags & IRQF_TRIGGER_MASK))
                new->flags |= irqd_get_trigger_type(&desc->irq_data);

On the first setup_irq() the trigger type in irq_data is NONE when the
above code executes, then the irq is started up for the first time and
then the actual trigger type gets established, but that's too late to fix
up new->flags.

When then a second user of the irq requests the irq with IRQF_TRIGGER_NONE
its irqaction's triggertype gets set to the actual trigger type and the
following check fails:

        if (!((old->flags ^ new->flags) & IRQF_TRIGGER_MASK))

Resulting in the request_irq failing with -EBUSY even though both
users requested the irq with IRQF_SHARED | IRQF_TRIGGER_NONE

Fix this by comparing the new irqaction's trigger type to the trigger type
stored in the irq_data which correctly reflects the actual trigger type
being used for the irq.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/20170415100831.17073-1-hdegoede@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-15 15:42:43 +02:00
Thomas Gleixner d1b25279c1 irqchip updates for v4.12
- Unify gemini and moxa irqchips under the faraday banner
 - Extend mtk-sysirq to deal with multiple MMIO regions
 - ACPI/IORT support for GICv3 ITS platform MSI
 - ACPI support for mbigen
 - Add mtk-cirq wakeup interrupt controller driver
 - Atmel aic5 suspend support
 - Allow GPCv2 to be probed both as an irqchip and a device
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAljyDtkVHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDqHEQAMgpZf+hrhuSDsJbs776RY0YrBlz
 nn3PRxlnJOu4hhDGKrjVokYUujTgA9MNUPUQn5Oj5CYKy+LCSo/jofr0c4Ko3xYx
 w8fLEp5X3Kvbhb9dzpFVJtgd8ldkKn1+U+bxpaagp8p+MGBMQo2IcAoAbKXTQMKn
 KzZkbN0EtXSU5pTxAtT9yWmWyL0av4rEXKIVx3Jl0vX/kCH47+Utu6aouDjyPbow
 bN8Vcq+BYe1WUltzNBqcde8KU4zfhtpneiNsZFxyqJRP/FW2d8V5/NMNa97b2T66
 s8bfYjAESmXM0vaz+Mi+ayO0qggTMjw/3dZxBeO2w9cdmN3eMm1TAwsHox5dKVSL
 HXptPM1PXEtJWhb2nO2hkfpveunAqPw6Khmi/U0+s//rtEJjaKuM/W8VtGwi9KnE
 VM6TEPPHM2u3gd/OdyuSC5cFwi6SrI3t78W4jzJ2R/kFS5MnMn2+5LTYKMtewkt0
 iafWE2Gy8GBybzDyp+k5COR1BRSfsjXheCMgkmgaPy1hZApK2A45CcAPzjAE8UPZ
 VwxOI+NXR7qORwqwecRUqGycv/1rZzBTLl/ygiMRM1MiLwtd1wbLpro8y+7tCOrO
 FMP5zGnfi/0raL8Uhc/riOGH1uM6LVblV+zGuTGOsia5dvk7Ga28AjVk6xzkTJjd
 cAmvuqnKDypRrCjp
 =kWx7
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier

 - Unify gemini and moxa irqchips under the faraday banner
 - Extend mtk-sysirq to deal with multiple MMIO regions
 - ACPI/IORT support for GICv3 ITS platform MSI
 - ACPI support for mbigen
 - Add mtk-cirq wakeup interrupt controller driver
 - Atmel aic5 suspend support
 - Allow GPCv2 to be probed both as an irqchip and a device
2017-04-15 15:31:11 +02:00
Thomas Gleixner 0e8d6a9336 workqueue: Provide work_on_cpu_safe()
work_on_cpu() is not protected against CPU hotplug. For code which requires
to be either executed on an online CPU or to fail if the CPU is not
available the callsite would have to protect against CPU hotplug.

Provide a function which does get/put_online_cpus() around the call to
work_on_cpu() and fails the call with -ENODEV if the target CPU is not
online.

Preparatory patch to convert several racy task affinity manipulations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/20170412201042.262610721@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-15 12:20:53 +02:00
Linus Torvalds 7e703eccf0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Things seem to be settling down as far as networking is concerned,
  let's hope this trend continues...

   1) Add iov_iter_revert() and use it to fix the behavior of
      skb_copy_datagram_msg() et al., from Al Viro.

   2) Fix the protocol used in the synthetic SKB we cons up for the
      purposes of doing a simulated route lookup for RTM_GETROUTE
      requests. From Florian Larysch.

   3) Don't add noop_qdisc to the per-device qdisc hashes, from Cong
      Wang.

   4) Don't call netdev_change_features with the team lock held, from
      Xin Long.

   5) Revert TCP F-RTO extension to catch more spurious timeouts because
      it interacts very badly with some middle-boxes. From Yuchung
      Cheng.

   6) Fix the loss of error values in l2tp {s,g}etsockopt calls, from
      Guillaume Nault.

   7) ctnetlink uses bit positions where it should be using bit masks,
      fix from Liping Zhang.

   8) Missing RCU locking in netfilter helper code, from Gao Feng.

   9) Avoid double frees and use-after-frees in tcp_disconnect(), from
      Eric Dumazet.

  10) Don't do a changelink before we register the netdevice in
      bridging, from Ido Schimmel.

  11) Lock the ipv6 device address list properly, from Rabin Vincent"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
  netfilter: ipt_CLUSTERIP: Fix wrong conntrack netns refcnt usage
  netfilter: nft_hash: do not dump the auto generated seed
  drivers: net: usb: qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201
  ipv6: Fix idev->addr_list corruption
  net: xdp: don't export dev_change_xdp_fd()
  bridge: netlink: register netdevice before executing changelink
  bridge: implement missing ndo_uninit()
  bpf: reference may_access_skb() from __bpf_prog_run()
  tcp: clear saved_syn in tcp_disconnect()
  netfilter: nf_ct_expect: use proper RCU list traversal/update APIs
  netfilter: ctnetlink: skip dumping expect when nfct_help(ct) is NULL
  netfilter: make it safer during the inet6_dev->addr_list traversal
  netfilter: ctnetlink: make it safer when checking the ct helper name
  netfilter: helper: Add the rcu lock when call __nf_conntrack_helper_find
  netfilter: ctnetlink: using bit to represent the ct event
  netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
  net: tcp: Increase TCP_MIB_OUTRSTS even though fail to alloc skb
  l2tp: don't mask errors in pppol2tp_getsockopt()
  l2tp: don't mask errors in pppol2tp_setsockopt()
  tcp: restrict F-RTO to work-around broken middle-boxes
  ...
2017-04-14 17:38:24 -07:00
Linus Torvalds d295917a47 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "The irq department provides:

   - two fixes for the CPU affinity spread infrastructure to prevent
     unbalanced spreading in corner cases which leads to horrible
     performance, because interrupts are rather aggregated than spread

   - add a missing spinlock initializer in the imx-gpcv2 init code"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-imx-gpcv2: Fix spinlock initialization
  irq/affinity: Fix extra vecs calculation
  irq/affinity: Fix CPU spread for unbalanced nodes
2017-04-14 16:57:14 -07:00
Steven Rostedt (VMware) 82cc4fc2e7 ftrace: Fix removing of second function probe
When two function probes are added to set_ftrace_filter, and then one of
them is removed, the update to the function locations is not performed, and
the record keeping of the function states are corrupted, and causes an
ftrace_bug() to occur.

This is easily reproducable by adding two probes, removing one, and then
adding it back again.

 # cd /sys/kernel/debug/tracing
 # echo schedule:traceoff > set_ftrace_filter
 # echo do_IRQ:traceoff > set_ftrace_filter
 # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
 # echo do_IRQ:traceoff > set_ftrace_filter

Causes:
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220
 Modules linked in: [...]
 CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 Call Trace:
  dump_stack+0x68/0x9f
  __warn+0x111/0x130
  ? trace_irq_work_interrupt+0xa0/0xa0
  warn_slowpath_null+0x1d/0x20
  ftrace_get_addr_curr+0x143/0x220
  ? __fentry__+0x10/0x10
  ftrace_replace_code+0xe3/0x4f0
  ? ftrace_int3_handler+0x90/0x90
  ? printk+0x99/0xb5
  ? 0xffffffff81000000
  ftrace_modify_all_code+0x97/0x110
  arch_ftrace_update_code+0x10/0x20
  ftrace_run_update_code+0x1c/0x60
  ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0
  register_ftrace_function_probe+0x4b6/0x590
  ? ftrace_startup+0x310/0x310
  ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30
  ? update_stack_state+0x88/0x110
  ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
  ? preempt_count_sub+0x18/0xd0
  ? mutex_lock_nested+0x104/0x800
  ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
  ? __unwind_start+0x1c0/0x1c0
  ? _mutex_lock_nest_lock+0x800/0x800
  ftrace_trace_probe_callback.isra.3+0xc0/0x130
  ? func_set_flag+0xe0/0xe0
  ? __lock_acquire+0x642/0x1790
  ? __might_fault+0x1e/0x20
  ? trace_get_user+0x398/0x470
  ? strcmp+0x35/0x60
  ftrace_trace_onoff_callback+0x48/0x70
  ftrace_regex_write.isra.43.part.44+0x251/0x320
  ? match_records+0x420/0x420
  ftrace_filter_write+0x2b/0x30
  __vfs_write+0xd7/0x330
  ? do_loop_readv_writev+0x120/0x120
  ? locks_remove_posix+0x90/0x2f0
  ? do_lock_file_wait+0x160/0x160
  ? __lock_is_held+0x93/0x100
  ? rcu_read_lock_sched_held+0x5c/0xb0
  ? preempt_count_sub+0x18/0xd0
  ? __sb_start_write+0x10a/0x230
  ? vfs_write+0x222/0x240
  vfs_write+0xef/0x240
  SyS_write+0xab/0x130
  ? SyS_read+0x130/0x130
  ? trace_hardirqs_on_caller+0x182/0x280
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL_64_fastpath+0x18/0xad
 RIP: 0033:0x7fe61c157c30
 RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30
 RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001
 RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700
 R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400
 R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c
  ? trace_hardirqs_off_caller+0xc0/0x110
 ---[ end trace 99fa09b3d9869c2c ]---
 Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150)

Cc: stable@vger.kernel.org
Fixes: 59df055f19 ("ftrace: trace different functions with a different tracer")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-14 17:54:22 -04:00
Deepa Dinamani ad19638463 time: Change k_clock nsleep() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel.

The syscall interfaces themselves will be changed in a separate series.

Note that the restart_block parameter for nanosleep has also been left
unchanged and will be part of syscall series noted above.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-8-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:56 +02:00
Deepa Dinamani 5f252b3256 time: Change k_clock timer_set() and timer_get() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel.

struct itimerspec internally uses struct timespec.  Use struct itimerspec64
which uses struct timespec64.

The syscall interfaces themselves will be changed in a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-7-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:56 +02:00
Deepa Dinamani 0fe6afe383 time: Change k_clock clock_set() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel.

The syscall interfaces themselves will be changed in a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-6-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:56 +02:00
Deepa Dinamani d2e3e0ca5d time: Change k_clock clock_getres() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel. The syscall
interfaces themselves will be changed in a separate series.

The clock_getres() interface has also been changed to use timespec64 even
though this particular interface is not affected by the y2038 problem. This
helps verification for internal kernel code for y2038 readiness by getting
rid of time_t/ timeval/ timespec completely.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:55 +02:00
Deepa Dinamani 3c9c12f4b4 time: Change k_clock clock_get() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel.

The syscall interfaces themselves will be changed in a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-4-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:55 +02:00
Deepa Dinamani d340266e19 time: Change posix clocks ops interfaces to use timespec64
struct timespec is not y2038 safe on 32 bit machines.

The posix clocks apis use struct timespec directly and through struct
itimerspec.

Replace the posix clock interfaces to use struct timespec64 and struct
itimerspec64 instead.  Also fix up their implementations accordingly.

Note that the clock_getres() interface has also been changed to use
timespec64 even though this particular interface is not affected by the
y2038 problem. This helps verification for internal kernel code for y2038
readiness by getting rid of time_t/ timeval/ timespec.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: arnd@arndb.de
Cc: y2038@lists.linaro.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1490555058-4603-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:54 +02:00
Deepa Dinamani 2ac00f17b2 time: Delete do_sys_setimeofday()
struct timespec is not y2038 safe on 32 bit machines and needs to be
replaced with struct timespec64.

do_sys_timeofday() is just a wrapper function.  Replace all calls to this
function with direct calls to do_sys_timeofday64() instead and delete
do_sys_timeofday().

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/r/1490555058-4603-2-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:54 +02:00
Matthias Kaehlcke d170fe7dd9 genirq: Use cpumask_available() for check of cpumask variable
This fixes the following clang warning when CONFIG_CPUMASK_OFFSTACK=n:

kernel/irq/manage.c:839:28: error: address of array
'desc->irq_common_data.affinity' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Cc: Grant Grundler <grundler@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: Michael Davidson <md@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20170412182030.83657-2-mka@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 20:49:27 +02:00
Peter Zijlstra 94ffac5d84 futex: Fix small (and harmless looking) inconsistencies
During (post-commit) review Darren spotted a few minor things. One
(harmless AFAICT) type inconsistency and a comment that wasn't as
clear as hoped.

Reported-by: Darren Hart (VMWare) <dvhart@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:30:07 +02:00