Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:
- Cache footprint optimizations for both hrtimers and timer wheel
- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.
- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter
- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf
- Some more leap second tweaks
- Another round of changes addressing the 2038 problem
- First step to change the internals of clock event devices by
introducing the necessary infrastructure
- Allow constant folding for usecs/msecs_to_jiffies()
- The usual pile of clockevent/clocksource driver updates
The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- lockless wakeup support for futexes and IPC message queues
(Davidlohr Bueso, Peter Zijlstra)
- Replace spinlocks with atomics in thread_group_cputimer(), to
improve scalability (Jason Low)
- NUMA balancing improvements (Rik van Riel)
- SCHED_DEADLINE improvements (Wanpeng Li)
- clean up and reorganize preemption helpers (Frederic Weisbecker)
- decouple page fault disabling machinery from the preemption
counter, to improve debuggability and robustness (David
Hildenbrand)
- SCHED_DEADLINE documentation updates (Luca Abeni)
- topology CPU masks cleanups (Bartosz Golaszewski)
- /proc/sched_debug improvements (Srikar Dronamraju)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
sched/deadline: Remove needless parameter in dl_runtime_exceeded()
sched: Remove superfluous resetting of the p->dl_throttled flag
sched/deadline: Drop duplicate init_sched_dl_class() declaration
sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
sched/deadline: Make init_sched_dl_class() __init
sched/deadline: Optimize pull_dl_task()
sched/preempt: Add static_key() to preempt_notifiers
sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
sched/debug: Properly format runnable tasks in /proc/sched_debug
sched/numa: Only consider less busy nodes as numa balancing destinations
Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
sched/fair: Prevent throttling in early pick_next_task_fair()
preempt: Reorganize the notrace definitions a bit
preempt: Use preempt_schedule_context() as the official tracing preemption point
sched: Make preempt_schedule_context() function-tracing safe
x86: Remove cpu_sibling_mask() and cpu_core_mask()
x86: Replace cpu_**_mask() with topology_**_cpumask()
...
Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.
Aside of that this allows to get rid of following nuisances:
- boot_tvec_base
That statically allocated 4k bss data is just kept around so the
timer has a home when it gets statically initialized. It serves no
other purpose.
With the CPU index we assign the timer to CPU0 at static
initialization time and therefor can avoid the whole boot_tvec_base
dance. That also simplifies the init code, which just can use the
per cpu base.
Before:
text data bss dec hex filename
17491 9201 4160 30852 7884 ../build/kernel/time/timer.o
After:
text data bss dec hex filename
17440 9193 0 26633 6809 ../build/kernel/time/timer.o
- Overloading the base pointer with various flags
The CPU index has enough space to hold the flags (deferrable,
irqsafe) so we can get rid of the extra masking and bit fiddling
with the base pointer.
As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.
This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.
We might have used bitfields for this, but that would change the
static initializers and the init function for no value to accomodate
big endian bitfields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Badhri Jagan Sridharan <Badhri@google.com>
Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull block layer fixes from Jens Axboe:
"Sending this off now, as I'm not aware of other current bugs, nor do I
expect further fixes before 4.1 final. This contains two fixes:
- a fix for a bdi unregister warning that gets spewed on md, due to a
regression introduced earlier in this cycle. From Neil Brown.
- a fix for a compile warning for NVMe on 32-bit platforms, also a
regression introduced in this cycle. From Arnd Bergmann"
* 'for-linus' of git://git.kernel.dk/linux-block:
NVMe: fix type warning on 32-bit
block: discard bdi_unregister() in favour of bdi_destroy()
Since tracepoints use RCU for protection, they must not be called on
offline cpus. trace_mm_page_pcpu_drain can be called on an offline cpu
in this scenario caught by LOCKDEP:
===============================
[ INFO: suspicious RCU usage. ]
4.1.0-rc1+ #9 Not tainted
-------------------------------
include/trace/events/kmem.h:265 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
1 lock held by swapper/5/0:
#0: (&(&zone->lock)->rlock){..-...}, at: [<c0000000002073b0>] .free_pcppages_bulk+0x70/0x920
stack backtrace:
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.1.0-rc1+ #9
Call Trace:
.dump_stack+0x98/0xd4 (unreliable)
.lockdep_rcu_suspicious+0x108/0x170
.free_pcppages_bulk+0x60c/0x920
.free_hot_cold_page+0x208/0x280
.destroy_context+0x90/0xd0
.__mmdrop+0x58/0x160
.idle_task_exit+0xf0/0x100
.pnv_smp_cpu_kill_self+0x58/0x2c0
.cpu_die+0x34/0x50
.arch_cpu_idle_dead+0x20/0x40
.cpu_startup_entry+0x708/0x7a0
.start_secondary+0x36c/0x3a0
start_secondary_prolog+0x10/0x14
Fix this by converting mm_page_pcpu_drain trace point into
TRACE_EVENT_CONDITION where condition is cpu_online(smp_processor_id())
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since tracepoints use RCU for protection, they must not be called on
offline cpus. trace_mm_page_free can be called on an offline cpu in this
scenario caught by LOCKDEP:
===============================
[ INFO: suspicious RCU usage. ]
4.1.0-rc1+ #9 Not tainted
-------------------------------
include/trace/events/kmem.h:170 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.1.0-rc1+ #9
Call Trace:
.dump_stack+0x98/0xd4 (unreliable)
.lockdep_rcu_suspicious+0x108/0x170
.free_pages_prepare+0x494/0x680
.free_hot_cold_page+0x50/0x280
.destroy_context+0x90/0xd0
.__mmdrop+0x58/0x160
.idle_task_exit+0xf0/0x100
.pnv_smp_cpu_kill_self+0x58/0x2c0
.cpu_die+0x34/0x50
.arch_cpu_idle_dead+0x20/0x40
.cpu_startup_entry+0x708/0x7a0
.start_secondary+0x36c/0x3a0
start_secondary_prolog+0x10/0x14
Fix this by converting mm_page_free trace point into TRACE_EVENT_CONDITION
where condition is cpu_online(smp_processor_id())
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since tracepoints use RCU for protection, they must not be called on
offline cpus. trace_kmem_cache_free can be called on an offline cpu in
this scenario caught by LOCKDEP:
===============================
[ INFO: suspicious RCU usage. ]
4.1.0-rc1+ #9 Not tainted
-------------------------------
include/trace/events/kmem.h:148 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.1.0-rc1+ #9
Call Trace:
.dump_stack+0x98/0xd4 (unreliable)
.lockdep_rcu_suspicious+0x108/0x170
.kmem_cache_free+0x344/0x4b0
.__mmdrop+0x4c/0x160
.idle_task_exit+0xf0/0x100
.pnv_smp_cpu_kill_self+0x58/0x2c0
.cpu_die+0x34/0x50
.arch_cpu_idle_dead+0x20/0x40
.cpu_startup_entry+0x708/0x7a0
.start_secondary+0x36c/0x3a0
start_secondary_prolog+0x10/0x14
Fix this by converting kmem_cache_free trace point into
TRACE_EVENT_CONDITION where condition is cpu_online(smp_processor_id())
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bdi_unregister() now contains very little functionality.
It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
real consequence as bdi->dev isn't needed by anything else in the function,
and it triggers if
blk_cleanup_queue() -> bdi_destroy()
is called before bdi_unregister, which happens since
Commit: 6cd18e711d ("block: destroy bdi before blockdev is unregistered.")
So this isn't wanted.
It also calls bdi_set_min_ratio(). This needs to be called after
writes through the bdi have all been flushed, and before the bdi is destroyed.
Calling it early is better than calling it late as it frees up a global
resource.
Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
perfectly fits these requirements.
So bdi_unregister() can be discarded with the important content moved to
bdi_destroy(), as can the
writeback_bdi_unregister
event which is already not used.
Reported-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org (v4.0)
Fixes: c4db59d31e ("fs: don't reassign dirty inodes to default_backing_dev_info")
Fixes: 6cd18e711d ("block: destroy bdi before blockdev is unregistered.")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The timer_start event now shows whether the timer is
deferrable in case of a low-res timer. The debug_activate
function now includes a deferrable flag while calling
the trace_timer_start event.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Badhri Jagan Sridharan <Badhri@google.com>
[jstultz: Fixed minor whitespace and grammer tweaks
pointed out by Ingo]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
all idle kthreads contribute to the loadavg is somewhat silly.
Now mostly this works OK, because kthreads have all their signals
masked. However there's a few sites where this is causing problems and
TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
This patch adds TASK_NOLOAD which, when combined with
TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
As most of imagined usage sites are loops where a thread wants to
idle, waiting for work, a helper TASK_IDLE is introduced.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
drivers and updates to existing ones for feature enhancements and bug
fixes. There is more churn than usual in the framework core due to the
change to introduce per-user unique struct clk pointers in 4.0. This
caused several regressions to surface, some of which were sent as fixes
to 4.0. New generic clock drivers were added for GPIO- and PWM-based
clock controllers. Additionally the common clk-divider code recieved
several fixes to the way it rounds rates.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVNcIIAAoJEKI6nJvDJaTU3a8QAM+fjhDMY5xpI6VIbxZaA2aR
VUofw9/rdAtP1UdwtlSKBvCqpwwqt/U7zlMWU9v+UvTjYdHIf9SIDQoJnd+uEtwL
roz/kNeB7WOVyxwbTJ2B5fjvPSN+mq8Rm8ANDcL8ZOGxxtt2Mip1IWMAlx2XUnwG
tYZhB7EfKzLHZRblOdn2Q4U/4T+KXOFTSO+Gb9o2J0I2sJLI0NRXhcl9Fcoo8KVz
G0ACWa0F1WKsbqzBATnhtYiKkuC3BeiS2eMuTVTlkP+Gd6YQ2f1zWLeBfXEiPGZb
q0p/qTrUFLHbRoJMMuWaUfaBxb8PeUfM6yllxrzvRxPJU25pbj8OW/O5ZAe9xP8G
S17sQ2nhEoWZW9hqbuA39IcLGa6RjT+TD+z3kmXQ9ZvCVDN2Oqqb/4ZNViwAvQq7
t67EfV7hGXty3Q58tS4XE9hHfwY+9YqMDLNIS/ED+hP8rcxTmiLlAIyk+qbT3b0l
Q+375Ar7iCgihPPHYxeM5Qe1+Vsfh4NjR9thdAbT245MB3f90ULb+GNP/izUDOgA
c/Ot6pStVFEUxTol6RlcLb85PugzrkoBOF/8ZLySdMLhALjPwaFcQZ1sFdcKUKlE
tt7sZKQgbbCfqYGS9K264uUfWbdmZh05zhtkH0xUjyQpyIcnrYQsSIIEEnlbYnPp
0D55nooSGROKeud+gyrx
=2LMr
-----END PGP SIGNATURE-----
Merge tag 'clk-for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clock framework updates from Michael Turquette:
"The changes to the common clock framework for 4.0 are mostly new clock
drivers and updates to existing ones for feature enhancements and bug
fixes.
There is more churn than usual in the framework core due to the change
to introduce per-user unique struct clk pointers in 4.0. This caused
several regressions to surface, some of which were sent as fixes to
4.0. New generic clock drivers were added for GPIO- and PWM-based
clock controllers.
Additionally the common clk-divider code recieved several fixes to the
way it rounds rates"
* tag 'clk-for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (91 commits)
clk: check ->determine/round_rate() return value in clk_calc_new_rates
clk: at91: usb: propagate rate modification to the parent clk
clk: samsung: exynos4: Disable ARMCLK down feature on Exynos4210 SoC
clk: don't use __initconst for non-const arrays
clk: at91: change to using endian agnositc IO
clk: clk-gpio-gate: Fix active low
clk: Add PWM clock driver
clk: Add clock driver for mb86s7x
clk: pxa: pxa3xx: add missing os timer clock
clk: tegra: Use the proper parent for plld_dsi
clk: tegra: Use generic tegra_osc_clk_init() on Tegra114
clk: tegra: Model oscillator as clock
clk: tegra: Add peripheral registers for bank Y
clk: tegra: Register the proper number of resets
clk: tegra: Remove needless initializations
clk: tegra: Use consistent indentation
clk: tegra: Various whitespace cleanups
clk: tegra: Enable HDA to HDMI clocks on Tegra124
clk: tegra: Fix a bunch of sparse warnings
clk: tegra: Fix typo tabel -> table
...
Pull perf updates from Ingo Molnar:
"This update has mostly fixes, but also other bits:
- perf tooling fixes
- PMU driver fixes
- Intel Broadwell PMU driver HW-enablement for LBR callstacks
- a late coming 'perf kmem' tool update that enables it to also
analyze page allocation data. Note, this comes with MM tracepoint
changes that we believe to not break anything: because it changes
the formerly opaque 'struct page *' field that uniquely identifies
pages to 'pfn' which identifies pages uniquely too, but isn't as
opaque and can be used for other purposes as well"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix and clean up error handling in pt_event_add()
perf/x86/intel: Add Broadwell support for the LBR callstack
perf/x86/intel/rapl: Fix energy counter measurements but supporing per domain energy units
perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events
perf/x86: Fix hw_perf_event::flags collision
perf probe: Fix segfault when probe with lazy_line to file
perf probe: Find compilation directory path for lazy matching
perf probe: Set retprobe flag when probe in address-based alternative mode
perf kmem: Analyze page allocator events also
tracing, mm: Record pfn instead of pointer to struct page
Pull f2fs updates from Jaegeuk Kim:
"New features:
- in-memory extent_cache
- fs_shutdown to test power-off-recovery
- use inline_data to store symlink path
- show f2fs as a non-misc filesystem
Major fixes:
- avoid CPU stalls on sync_dirty_dir_inodes
- fix some power-off-recovery procedure
- fix handling of broken symlink correctly
- fix missing dot and dotdot made by sudden power cuts
- handle wrong data index during roll-forward recovery
- preallocate data blocks for direct_io
... and a bunch of minor bug fixes and cleanups"
* tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (71 commits)
f2fs: pass checkpoint reason on roll-forward recovery
f2fs: avoid abnormal behavior on broken symlink
f2fs: flush symlink path to avoid broken symlink after POR
f2fs: change 0 to false for bool type
f2fs: do not recover wrong data index
f2fs: do not increase link count during recovery
f2fs: assign parent's i_mode for empty dir
f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries
f2fs: fix mismatching lock and unlock pages for roll-forward recovery
f2fs: fix sparse warnings
f2fs: limit b_size of mapped bh in f2fs_map_bh
f2fs: persist system.advise into on-disk inode
f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get
f2fs: preallocate fallocated blocks for direct IO
f2fs: enable inline data by default
f2fs: preserve extent info for extent cache
f2fs: initialize extent tree with on-disk extent info of inode
f2fs: introduce __{find,grab}_extent_tree
f2fs: split set_data_blkaddr from f2fs_update_extent_cache
f2fs: enable fast symlink by utilizing inline data
...
This patch adds CP_RECOVERY to remain recovery information for checkpoint.
And, it makes sure writing checkpoint in this case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add trace events for cma_alloc() and cma_release().
The cma_alloc tracepoint is used both for successful and failed allocations,
in case of allocation failure pfn=-1UL is stored and printed.
Signed-off-by: Stefan Strogin <stefan.strogin@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mpn@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge first patchbomb from Andrew Morton:
- arch/sh updates
- ocfs2 updates
- kernel/watchdog feature
- about half of mm/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
Documentation: update arch list in the 'memtest' entry
Kconfig: memtest: update number of test patterns up to 17
arm: add support for memtest
arm64: add support for memtest
memtest: use phys_addr_t for physical addresses
mm: move memtest under mm
mm, hugetlb: abort __get_user_pages if current has been oom killed
mm, mempool: do not allow atomic resizing
memcg: print cgroup information when system panics due to panic_on_oom
mm: numa: remove migrate_ratelimited
mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
s390: standardize mmap_rnd() usage
powerpc: standardize mmap_rnd() usage
mips: extract logic for mmap_rnd()
arm64: standardize mmap_rnd() usage
x86: standardize mmap_rnd() usage
arm: factor out mmap ASLR into mmap_rnd
...
We would want to use number of page table level to define mm_struct.
Let's expose it as CONFIG_PGTABLE_LEVELS.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
Tracepoints have helper functions for the TP_printk() called
__print_symbolic() and __print_flags() that lets a numeric number be
displayed as a a human comprehensible text. What is placed in the
TP_printk() is also shown in the tracepoint format file such that
user space tools like perf and trace-cmd can parse the binary data
and express the values too. Unfortunately, the way the TRACE_EVENT()
macro works, anything placed in the TP_printk() will be shown pretty
much exactly as is. The problem arises when enums are used. That's
because unlike macros, enums will not be changed into their values
by the C pre-processor. Thus, the enum string is exported to the
format file, and this makes it useless for user space tools.
The TRACE_DEFINE_ENUM() solves this by converting the enum strings
in the TP_printk() format into their number, and that is what is
shown to user space. For example, the tracepoint tlb_flush currently
has this in its format file:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
After adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
Its format file will contain this:
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVLBTuAAoJEEjnJuOKh9ldjHMIALdRS755TXCZGOf0r7O2akOR
wMPeum7C+ae1mH+jCsJKUC0/jUfQKaMt/UxoHlipDgcGg8kD2jtGnGCw4Xlwvdsr
y4rFmcTRSl1mo0zDSsg6ujoupHlVYN0+JPjrd7S3cv/llJoY49zcanNLF7S2XLeM
dZCtWRLWYpBiWO68ai6AqJTnE/eGFIqBI048qb5Eg8dbK243SSeSIf9Ywhb+VsA+
aq6F7cWI/H6j4tbeza8tAN19dcwenDro5EfCDY8ARQHJu1f6Y3+DLf2imjkd6Aiu
JVAoGIjHIpI+djwCZC1u4gi4urjfOqYartrM3Q54tb3YWYqHeNqP2ASI2a4EpYk=
=Ixwt
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Some clean ups and small fixes, but the biggest change is the addition
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
Tracepoints have helper functions for the TP_printk() called
__print_symbolic() and __print_flags() that lets a numeric number be
displayed as a a human comprehensible text. What is placed in the
TP_printk() is also shown in the tracepoint format file such that user
space tools like perf and trace-cmd can parse the binary data and
express the values too. Unfortunately, the way the TRACE_EVENT()
macro works, anything placed in the TP_printk() will be shown pretty
much exactly as is. The problem arises when enums are used. That's
because unlike macros, enums will not be changed into their values by
the C pre-processor. Thus, the enum string is exported to the format
file, and this makes it useless for user space tools.
The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
the TP_printk() format into their number, and that is what is shown to
user space. For example, the tracepoint tlb_flush currently has this
in its format file:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
After adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
Its format file will contain this:
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })"
* tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
tracing: Add enum_map file to show enums that have been mapped
writeback: Export enums used by tracepoint to user space
v4l: Export enums used by tracepoints to user space
SUNRPC: Export enums in tracepoints to user space
mm: tracing: Export enums in tracepoints to user space
irq/tracing: Export enums in tracepoints to user space
f2fs: Export the enums in the tracepoints to userspace
net/9p/tracing: Export enums in tracepoints to userspace
x86/tlb/trace: Export enums in used by tlb_flush tracepoint
tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
tracing: Allow for modules to convert their enums to values
tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
tracing: Give system name a pointer
brcmsmac: Move each system tracepoints to their own header
iwlwifi: Move each system tracepoints to their own header
mac80211: Move message tracepoints to their own header
tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
tracing: Add TRACE_SYSTEM_VAR to kvm-s390
tracing: Add TRACE_SYSTEM_VAR to intel-sst
...
Pull libata updates from Tejun Heo:
- Hannes's patchset implements support for better error reporting
introduced by the new ATA command spec.
- the deperecated pci_ dma API usages have been replaced by dma_ ones.
- a bunch of hardware specific updates and some cleanups.
* 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ata: remove deprecated use of pci api
ahci: st: st_configure_oob must be called after IP is clocked.
ahci: st: Update the ahci_st DT documentation
ahci: st: Update the DT example for how to obtain the PHY.
sata_dwc_460ex: indent an if statement
libata: Add tracepoints
libata-eh: Set 'information' field for autosense
libata: Implement support for sense data reporting
libata: Implement NCQ autosense
libata: use status bit definitions in ata_dump_status()
ide,ata: Rename ATA_IDX to ATA_SENSE
libata: whitespace fixes in ata_to_sense_error()
libata: whitespace cleanup in ata_get_cmd_descript()
libata: use READ_LOG_DMA_EXT
libata: remove ATA_FLAG_LOWTAG
sata_dwc_460ex: re-use hsdev->dev instead of dwc_dev
sata_dwc_460ex: move to generic DMA driver
sata_dwc_460ex: join messages back
sata: xgene: add ACPI support for APM X-Gene SATA ports
ata: sata_mv: add proper definitions for LP_PHY_CTL register values
The struct page is opaque for userspace tools, so it'd be better to save
pfn in order to identify page frames.
The textual output of $debugfs/tracing/trace file remains unchanged and
only raw (binary) data format is changed - but thanks to libtraceevent,
userspace tools which deal with the raw data (like perf and trace-cmd)
can parse the format easily. So impact on the userspace will also be
minimal.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Based-on-patch-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1428298576-9785-3-git-send-email-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The enums used in tracepoints for __print_symbolic() do not have their
values shown in the tracepoint format files and this makes it difficult
for user space tools to convert the binary values to the strings they
are to represent.
Add TRACE_DEFINE_ENUM(x) macros to export the enum names to their values
to make the tracing output from user space tools more robust.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enums used by tracepoints for __print_symbolic() are shown in the
tracepoint format files with just their names and not their values.
This makes it difficult for user space tools to know how to convert the
binary data into their string representations.
By adding the use of TRACE_DEFINE_ENUM(), the enum names will be mapped
to their values and shown in the tracing file system to let tools
convert the data as necessary.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enums used in the tracepoints for __print_symbolic() have their
names shown in the tracepoint format files. User space tools do not know
how to convert those names into their values to be able to convert the
binary data.
Use TRACE_DEFINE_ENUM() to export the enum names to their values for
userspace to do the parsing correctly.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enums used in tracepoints with __print_symbolic() have their
names shown in the tracepoint format files and not their values.
This makes it difficult for user space tools to convert the binary
data to the strings as user space does not know what those enums
are about.
By having them use TRACE_DEFINE_ENUM(), the names of the enums will
be mapped to the values and shown to user space.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enums used by the softirq mapping is what is shown in the output
of the __print_symbolic() and not their values, that are needed
to map them to their strings. Export them to userspace with the
TRACE_DEFINE_ENUM() macro so that user space tools can map the enums
with their values.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracepoints that use __print_symbolic() use enums as the value
to convert to strings. Unfortunately, the format files for these
tracepoints show the enum name and not their value. This causes some
userspace tools not to know how to convert __print_symbolic() to
their strings.
Add TRACE_DEFINE_ENUM() macros to export the enums used to userspace
to let those tools know what those enum values are.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracepoints in the 9p code use a lot of enums for the __print_symbolic()
function. These enums are shown in the tracepoint format files, and user
space tools such as trace-cmd does not have the information to parse it.
Add helper macros to export the enums with TRACE_DEFINE_ENUM().
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Have the enums used in __print_symbolic() by the trace_tlb_flush()
tracepoint exported to userpace such that they can be parsed by
userspace tools.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Dave Hansen <dave@sr71.net>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Several tracepoints use the helper functions __print_symbolic() or
__print_flags() and pass in enums that do the mapping between the
binary data stored and the value to print. This works well for reading
the ASCII trace files, but when the data is read via userspace tools
such as perf and trace-cmd, the conversion of the binary value to a
human string format is lost if an enum is used, as userspace does not
have access to what the ENUM is.
For example, the tracepoint trace_tlb_flush() has:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
Which maps the enum values to the strings they represent. But perf and
trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would
not be able to map it.
With TRACE_DEFINE_ENUM(), developers can place these in the event header
files and ftrace will convert the enums to their values:
By adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
$ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format
[...]
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })
The above is what userspace expects to see, and tools do not need to
be modified to parse them.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Guilherme Cox <cox@computer.org>
Cc: Tony Luck <tony.luck@gmail.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Normally the compiler will use the same pointer for a string throughout
the file. But there's no guarantee of that happening. Later changes will
require that all events have the same pointer to the system string.
Name the system string and have all events point to it.
Testing this, it did not increases the size of the text, except for the
notes section, which should not harm the real size any.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
New code will require TRACE_SYSTEM to be a valid C variable name,
but some tracepoints have TRACE_SYSTEM with '-' and not '_', so
it can not be used. Instead, add a TRACE_SYSTEM_VAR that can
give the tracing infrastructure a unique name for the trace system.
Link: http://lkml.kernel.org/r/20150402142831.GT6023@sirena.org.uk
Acked-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add some tracepoints for ata_qc_issue, ata_qc_complete, and
ata_eh_link_autopsy.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
The tracing events for regmap are confined to the regmap subsystem. It
also requires accessing an internal header. Instead of including the
internal header from a generic file location, move the tracing file
into the regmap directory.
Also rename the regmap tracing header to trace.h, as it is redundant to
keep the regmap.h name when it is in the regmap directory.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
It's useful to have tracepoints around operations that change the
hardware state so that we can debug clock hardware performance
and operations. Four basic types of events are supported: on/off
events for enable, disable, prepare, unprepare that only record
an event and a clock name, rate changing events for
clk_set_{min_,max_}rate{_range}(), phase changing events for
clk_set_phase() and parent changing events for clk_set_parent().
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Michael Turquette <mturquette@linaro.org>
This patch adds trace for lookup/update/shrink/destroy ops in rb-tree extent cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull lazytime mount option support from Al Viro:
"Lazytime stuff from tytso"
* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ext4: add optimization for the lazytime mount option
vfs: add find_inode_nowait() function
vfs: add support for a lazytime mount option
Common: Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other architectures).
This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes
or TCP_RR netperf tests). This also has to be enabled manually for now,
but the plan is to auto-tune this in the future.
ARM/ARM64: the highlights are support for GICv3 emulation and dirty page
tracking
s390: several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS: Bugfixes.
x86: Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested virtualization
improvements (nested APICv---a nice optimization), usual round of emulation
fixes. There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
ARM has other conflicts where functions are added in the same place
by 3.19-rc and 3.20 patches. These are not large though, and entirely
within KVM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJU28rkAAoJEL/70l94x66DXqQH/1TDOfJIjW7P2kb0Sw7Fy1wi
cEX1KO/VFxAqc8R0E/0Wb55CXyPjQJM6xBXuFr5cUDaIjQ8ULSktL4pEwXyyv/s5
DBDkN65mriry2w5VuEaRLVcuX9Wy+tqLQXWNkEySfyb4uhZChWWHvKEcgw5SqCyg
NlpeHurYESIoNyov3jWqvBjr4OmaQENyv7t2c6q5ErIgG02V+iCux5QGbphM2IC9
LFtPKxoqhfeB2xFxTOIt8HJiXrZNwflsTejIlCl/NSEiDVLLxxHCxK2tWK/tUXMn
JfLD9ytXBWtNMwInvtFm4fPmDouv2VDyR0xnK2db+/axsJZnbxqjGu1um4Dqbak=
=7gdx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
Pull f2fs updates from Jaegeuk Kim:
"Major changes are to:
- add f2fs_io_tracer and F2FS_IOC_GETVERSION
- fix wrong acl assignment from parent
- fix accessing wrong data blocks
- fix wrong condition check for f2fs_sync_fs
- align start block address for direct_io
- add and refactor the readahead flows of FS metadata
- refactor atomic and volatile write policies
But most of patches are for clean-ups and minor bug fixes. Some of
them refactor old code too"
* tag 'for-f2fs-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (64 commits)
f2fs: use spinlock for segmap_lock instead of rwlock
f2fs: fix accessing wrong indexed data blocks
f2fs: avoid variable length array
f2fs: fix sparse warnings
f2fs: allocate data blocks in advance for f2fs_direct_IO
f2fs: introduce macros to convert bytes and blocks in f2fs
f2fs: call set_buffer_new for get_block
f2fs: check node page contents all the time
f2fs: avoid data offset overflow when lseeking huge file
f2fs: fix to use highmem for pages of newly created directory
f2fs: introduce a batched trim
f2fs: merge {invalidate,release}page for meta/node/data pages
f2fs: show the number of writeback pages in stat
f2fs: keep PagePrivate during releasepage
f2fs: should fail mount when trying to recover data on read-only dev
f2fs: split UMOUNT and FASTBOOT flags
f2fs: avoid write_checkpoint if f2fs is mounted readonly
f2fs: support norecovery mount option
f2fs: fix not to drop mount options when retrying fill_super
f2fs: merge flags in struct f2fs_sb_info
...
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
This time with:
* Generic page-table framework for ARM IOMMUs using the LPAE page-table
format, ARM-SMMU and Renesas IPMMU make use of it already.
* Break out of the IO virtual address allocator from the Intel IOMMU so
that it can be used by other DMA-API implementations too. The first
user will be the ARM64 common DMA-API implementation for IOMMUs
* Device tree support for Renesas IPMMU
* Various fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJU3MJOAAoJECvwRC2XARrjopUP+wachFx8vb00M4hlnlwL6FCn
DyIFkA1n4wL0muPhjcBI+LViEXrSxjr2TYoJEaBg+fiByWWQ1Hefg+KPz331Lo1D
+uo7WiOa1AB3pfkQiUN9IN6xx+o6ivhb3UQPiL4FHjggB/qz+KVxMM9nx0j8o0fQ
D9q6HLFiOIsFkra3xZaSuDGvYUBpcwyfn8FP1HVfvLlg1uxIGDcUJX3qU5UBpj9q
al/lPZ4A7rp+JLApV6WyouPiyVOZKikb5x920KeRNBem7a9fNBdgf+x7QbKpNXa1
5MaT5MarwGe8lJE4wtjOqRtsllhia+A1rg/6JbROPrlGetRFiuIh2sCKLvwOCko/
IjBHSutpaRT1lFoAG0TAnXQlvHRG/58XxOlP3eF613X/p8/cezuUaTyTIwZam9X3
j2GWwbUcBiHTxlu7bQDPz6a7cTf4w6wEALzYl18QrAFv+2LqlCfOo/LSlpStmjrF
kRN8DYaohlTULvmFneSr8rfGsnp5yPgIPvdmqiSwTz/Ih7kYPgfLy6+v6IAHUqZj
0n9oGs8eMqVvSzM2qqmyA9WGuQZRyhNjj4iDwn/he5YMw2kqxUQYGMpLnSu0Oi48
n4PqodtVol64jKLwaHZwyU8u71iyjUC5K9TDot/I2wlSRcTELJhxGh6c1sfDLyrO
u/htIszgKCgFvVrQoEZB
=dwrA
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"This time with:
- Generic page-table framework for ARM IOMMUs using the LPAE
page-table format, ARM-SMMU and Renesas IPMMU make use of it
already.
- Break out the IO virtual address allocator from the Intel IOMMU so
that it can be used by other DMA-API implementations too. The
first user will be the ARM64 common DMA-API implementation for
IOMMUs
- Device tree support for Renesas IPMMU
- Various fixes and cleanups all over the place"
* tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits)
iommu/amd: Convert non-returned local variable to boolean when relevant
iommu: Update my email address
iommu/amd: Use wait_event in put_pasid_state_wait
iommu/amd: Fix amd_iommu_free_device()
iommu/arm-smmu: Avoid build warning
iommu/fsl: Various cleanups
iommu/fsl: Use %pa to print phys_addr_t
iommu/omap: Print phys_addr_t using %pa
iommu: Make more drivers depend on COMPILE_TEST
iommu/ipmmu-vmsa: Fix IOMMU lookup when multiple IOMMUs are registered
iommu: Disable on !MMU builds
iommu/fsl: Remove unused fsl_of_pamu_ids[]
iommu/fsl: Fix section mismatch
iommu/ipmmu-vmsa: Use the ARM LPAE page table allocator
iommu: Fix trace_map() to report original iova and original size
iommu/arm-smmu: add support for iova_to_phys through ATS1PR
iopoll: Introduce memory-mapped IO polling macros
iommu/arm-smmu: don't touch the secure STLBIALL register
iommu/arm-smmu: make use of generic LPAE allocator
iommu: io-pgtable-arm: add non-secure quirk
...
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways to
make trace events. Lots of features have been added since the sample
code was made, and these features are mostly unknown. Developers
have been making their own hacks to do things that are already available.
o Performance improvements. Most notably, I found a performance bug where
a waiter that is waiting for a full page from the ring buffer will
see that a full page is not available, and go to sleep. The sched
event caused by it going to sleep would cause it to wake up again.
It would see that there was still not a full page, and go back to sleep
again, and that would wake it up again, until finally it would see a
full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3M+GAAoJEEjnJuOKh9ldpWQIAJTUzeVXlU0cf3bVn768VW7e
XS41WHF34l1tNevmKTh6fCPiw8+U0UMGLQt5WKtyaaARsZn2MlefLVuvHPKFlK2w
+qcI4OEVHH97Qgf9HWJSsYgnZaOnOE+TENqnokEgXMimRMuVcd/S4QaGxwJVDcjm
iBF5j2TaG4aGbx4a3J7KueoZ3K+39r3ut15hIGi/IZBZldQ1pt26ytafD/KA3CU3
BLRM2HLttAMsV1ds0EDLgZjSGICVetFcdOmI5Gwj7Qr3KrOTRPYJMNc8NdDL7Js9
v8VhujhFGvcCrhO/IKpVvd9yluz3RCF+Z7ihc+D/+1B3Nsm0PTwN3Fl5J+f89AA=
=u2Mm
-----END PGP SIGNATURE-----
Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The updates included in this pull request for ftrace are:
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways
to make trace events. Lots of features have been added since the
sample code was made, and these features are mostly unknown.
Developers have been making their own hacks to do things that are
already available.
o Performance improvements. Most notably, I found a performance bug
where a waiter that is waiting for a full page from the ring buffer
will see that a full page is not available, and go to sleep. The
sched event caused by it going to sleep would cause it to wake up
again. It would see that there was still not a full page, and go
back to sleep again, and that would wake it up again, until finally
it would see a full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths"
* tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Do not wake up a splice waiter when page is not full
tracing: Fix unmapping loop in tracing_mark_write
tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
tracing: Add TRACE_EVENT_FN example
tracing: Add TRACE_EVENT_CONDITION sample
tracing: Update the TRACE_EVENT fields available in the sample code
tracing: Separate out initializing top level dir from instances
tracing: Make tracing_init_dentry_tr() static
trace: Use 64-bit timekeeping
tracing: Add array printing helper
tracing: Remove newline from trace_printk warning banner
tracing: Use IS_ERR() check for return value of tracing_init_dentry()
tracing: Remove unneeded includes of debugfs.h and fs.h
tracing: Remove taking of trace_types_lock in pipe files
tracing: Add ref count to tracer for when they are being read by pipe
When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages(). The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.
Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks. Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.
The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results. Here I used different baseline which includes RFC compaction
improvements from [1]. I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.
First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test. First
column is after fresh reboot, other two are reiterations of test without
reboot. That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Page alloc extfrag event 10264225 8702233 10244125
Extfrag fragmenting 10263271 8701552 10243473
Extfrag fragmenting for unmovable 13595 17616 15960
Extfrag fragmenting unmovable placed with movable 7989 12193 8447
Extfrag fragmenting for reclaimable 658 1840 1817
Extfrag fragmenting reclaimable placed with movable 558 1677 1679
Extfrag fragmenting for movable 10249018 8682096 10225696
With Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Page alloc extfrag event 11834954 9877523 9774860
Extfrag fragmenting 11833993 9876880 9774245
Extfrag fragmenting for unmovable 7342 16129 11712
Extfrag fragmenting unmovable placed with movable 4191 10547 6270
Extfrag fragmenting for reclaimable 373 1130 923
Extfrag fragmenting reclaimable placed with movable 302 906 738
Extfrag fragmenting for movable 11826278 9859621 9761610
With Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Page alloc extfrag event 4725990 3668793 3807436
Extfrag fragmenting 4725104 3668252 3806898
Extfrag fragmenting for unmovable 6678 7974 7281
Extfrag fragmenting unmovable placed with movable 2051 3829 4017
Extfrag fragmenting for reclaimable 429 1208 1278
Extfrag fragmenting reclaimable placed with movable 369 976 1034
Extfrag fragmenting for movable 4717997 3659070 3798339
With Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Page alloc extfrag event 5016183 4700142 3850633
Extfrag fragmenting 5015325 4699613 3850072
Extfrag fragmenting for unmovable 1312 3154 3088
Extfrag fragmenting unmovable placed with movable 1115 2777 2714
Extfrag fragmenting for reclaimable 437 1193 1097
Extfrag fragmenting reclaimable placed with movable 330 969 879
Extfrag fragmenting for movable 5013576 4695266 3845887
In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise. Here, each patch
improves the situation for unmovable events. Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO. The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless. These are least critical as
compaction can move them around.
If we look at success rates, the patches don't affect them, that didn't change.
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%)
Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%)
Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%)
Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%)
Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%)
Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%)
Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%)
Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)
Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%)
Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%)
Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%)
Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%)
Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%)
Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%)
Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%)
Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%)
Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%)
Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%)
Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%)
Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%)
Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)
Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%)
Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%)
Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%)
Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%)
Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%)
Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%)
Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%)
Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%)
Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%)
While there's no improvement here, I consider reduced fragmentation events
to be worth on its own. Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:
Patch 1:
Compaction stalls 4153 3959 3978
Compaction success 1523 1441 1446
Compaction failures 2630 2517 2531
Page migrate success 4600827 4943120 5104348
Page migrate failure 19763 16656 17806
Compaction pages isolated 9597640 10305617 10653541
Compaction migrate scanned 77828948 86533283 87137064
Compaction free scanned 517758295 521312840 521462251
Compaction cost 5503 5932 6110
Patch 2:
Compaction stalls 3800 3450 3518
Compaction success 1421 1316 1317
Compaction failures 2379 2134 2201
Page migrate success 4160421 4502708 4752148
Page migrate failure 19705 14340 14911
Compaction pages isolated 8731983 9382374 9910043
Compaction migrate scanned 98362797 96349194 98609686
Compaction free scanned 496512560 469502017 480442545
Compaction cost 5173 5526 5811
As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.
Configuring the benchmark to allocate like THP page fault (i.e. no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot. This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-thp-1 5-thp-2 5-thp-3
Page alloc extfrag event 8148965 6227815 6646741
Extfrag fragmenting 8147872 6227130 6646117
Extfrag fragmenting for unmovable 10324 12942 15975
Extfrag fragmenting unmovable placed with movable 5972 8495 10907
Extfrag fragmenting for reclaimable 601 1707 2210
Extfrag fragmenting reclaimable placed with movable 520 1570 2000
Extfrag fragmenting for movable 8136947 6212481 6627932
Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-thp-1 6-thp-2 6-thp-3
Page alloc extfrag event 8345457 7574471 7020419
Extfrag fragmenting 8343546 7573777 7019718
Extfrag fragmenting for unmovable 10256 18535 30716
Extfrag fragmenting unmovable placed with movable 6893 11726 22181
Extfrag fragmenting for reclaimable 465 1208 1023
Extfrag fragmenting reclaimable placed with movable 353 996 843
Extfrag fragmenting for movable 8332825 7554034 6987979
Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-thp-1 7-thp-2 7-thp-3
Page alloc extfrag event 3512847 3020756 2891625
Extfrag fragmenting 3511940 3020185 2891059
Extfrag fragmenting for unmovable 9017 6892 6191
Extfrag fragmenting unmovable placed with movable 1524 3053 2435
Extfrag fragmenting for reclaimable 445 1081 1160
Extfrag fragmenting reclaimable placed with movable 375 918 986
Extfrag fragmenting for movable 3502478 3012212 2883708
Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-thp-1 8-thp-2 8-thp-3
Page alloc extfrag event 3181699 3082881 2674164
Extfrag fragmenting 3180812 3082303 2673611
Extfrag fragmenting for unmovable 1201 4031 4040
Extfrag fragmenting unmovable placed with movable 974 3611 3645
Extfrag fragmenting for reclaimable 478 1165 1294
Extfrag fragmenting reclaimable placed with movable 387 985 1030
Extfrag fragmenting for movable 3179133 3077107 2668277
The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1. Anyway, patch 2 rectifies it.
Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.
[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
This patch (of 3):
When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype. With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:
1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)
These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g. distribute UNMOVABLE allocations among multiple
pageblocks.
Originally, decision for 1) has implied the decision for 3). Commit
47118af076 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.
Commit fef903efcf ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended. Commit 0cbef29a78 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1). This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.
This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.
Note that evaluation has shown that the behavior introduced by
47118af076 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again. For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a78.
[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.13+ containing 0cbef29a78]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction deferring logic is heavy hammer that block the way to the
compaction. It doesn't consider overall system state, so it could prevent
user from doing compaction falsely. In other words, even if system has
enough range of memory to compact, compaction would be skipped due to
compaction deferring logic. This patch add new tracepoint to understand
work of deferring logic. This will also help to check compaction success
and fail.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not well analyzed that when/why compaction start/finish or not.
With these new tracepoints, we can know much more about start/finish
reason of compaction. I can find following bug with these tracepoint.
http://www.spinics.net/lists/linux-mm/msg81582.html
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>