This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.
Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.
Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.
In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions
1. Long-term isolation of highmem pages when reclaim is lowmem
When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.
That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.
2. Linear scan lowmem pages if the initial LRU shrink fails
This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.
Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.
2) Make DSA binding more sane, from Andrew Lunn.
3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.
6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
8) Simplify netlink conntrack entry layout, from Florian Westphal.
9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.
10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.
11) Support qdisc packet injection in pktgen, from John Fastabend.
12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
13) Add NV congestion control support to TCP, from Lawrence Brakmo.
14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
16) Support MPLS over IPV4, from Simon Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
To detect whether khugepaged swapin is worthwhile, this patch checks the
amount of young pages. There should be at least half of HPAGE_PMD_NR to
swapin.
Link: http://lkml.kernel.org/r/1468109451-1615-1-git-send-email-ebru.akagunduz@gmail.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch extends khugepaged to support collapse of tmpfs/shmem pages.
We share fair amount of infrastructure with anon-THP collapse.
Few design points:
- First we are looking for VMA which can be suitable for mapping huge
page;
- If the VMA maps shmem file, the rest scan/collapse operations
operates on page cache, not on page tables as in anon VMA case.
- khugepaged_scan_shmem() finds a range which is suitable for huge
page. The scan is lockless and shouldn't disturb system too much.
- once the candidate for collapse is found, collapse_shmem() attempts
to create a huge page:
+ scan over radix tree, making the range point to new huge page;
+ new huge page is not-uptodate, locked and freezed (refcount
is 0), so nobody can touch them until we say so.
+ we swap in pages during the scan. khugepaged_scan_shmem()
filters out ranges with more than khugepaged_max_ptes_swap
swapped out pages. It's HPAGE_PMD_NR/8 by default.
+ old pages are isolated, unmapped and put to local list in case
to be restored back if collapse failed.
- if collapse succeed, we retract pte page tables from VMAs where huge
pages mapping is possible. The huge page will be mapped as PMD on
next minor fault into the range.
Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes swapin readahead to improve thp collapse rate. When
khugepaged scanned pages, there can be a few of the pages in swap area.
With the patch THP can collapse 4kB pages into a THP when there are up
to max_ptes_swap swap ptes in a 2MB range.
The patch was tested with a test program that allocates 400B of memory,
writes to it, and then sleeps. I force the system to swap out all.
Afterwards, the test program touches the area by writing, it skips a
page in each 20 pages of the area.
Without the patch, system did not swap in readahead. THP rate was %65
of the program of the memory, it did not change over time.
With this patch, after 10 minutes of waiting khugepaged had collapsed
%99 of the program's memory.
[kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
[kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
[kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new sysfs integer knob
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
optimistic check for swapin readahead to increase thp collapse rate.
Before getting swapped out pages to memory, checks them and allows up to a
certain number. It also prints out using tracepoints amount of unmapped
ptes.
[vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
[sfr@canb.auug.org.au: build fix]
Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing. In particular, it ensures that
sync only needs to walk through a list of inodes that were cleaned by
the sync.
Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists. Piggyback off of the writeback
lazytime tracepoint template as it already tracks the relevant inode
information.
Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
cc: Josef Bacik <jbacik@fb.com>
Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block driver updates from Jens Axboe:
"This branch also contains core changes. I've come to the conclusion
that from 4.9 and forward, I'll be doing just a single branch. We
often have dependencies between core and drivers, and it's hard to
always split them up appropriately without pulling core into drivers
when that happens.
That said, this contains:
- separate secure erase type for the core block layer, from
Christoph.
- set of discard fixes, from Christoph.
- bio shrinking fixes from Christoph, as a followup up to the
op/flags change in the core branch.
- map and append request fixes from Christoph.
- NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
exciting!
- nvme-loop fixes from Arnd.
- removal of ->driverfs_dev from Dan, after providing a
device_add_disk() helper.
- bcache fixes from Bhaktipriya and Yijing.
- cdrom subchannel read fix from Vchannaiah.
- set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.
- set of drbd updates and fixes from Fabian, Lars, and Philipp.
- mg_disk error path fix from Bart.
- user notification for failed device add for loop, from Minfei.
- NVMe in general:
+ NVMe delay quirk from Guilherme.
+ SR-IOV support and command retry limits from Keith.
+ fix for memory-less NUMA node from Masayoshi.
+ use UINT_MAX for discard sectors, from Minfei.
+ cancel IO fixes from Ming.
+ don't allocate unused major, from Neil.
+ error code fixup from Dan.
+ use constants for PSDT/FUSE from James.
+ variable init fix from Jay.
+ fabrics fixes from Ming, Sagi, and Wei.
+ various fixes"
* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
nvme/pci: Provide SR-IOV support
nvme: initialize variable before logical OR'ing it
block: unexport various bio mapping helpers
scsi/osd: open code blk_make_request
target: stop using blk_make_request
block: simplify and export blk_rq_append_bio
block: ensure bios return from blk_get_request are properly initialized
virtio_blk: use blk_rq_map_kern
memstick: don't allow REQ_TYPE_BLOCK_PC requests
block: shrink bio size again
block: simplify and cleanup bvec pool handling
block: get rid of bio_rw and READA
block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
NVMe: don't allocate unused nvme_major
nvme: avoid crashes when node 0 is memoryless node.
nvme: Limit command retries
loop: Make user notify for adding loop device failed
nvme-loop: fix nvme-loop Kconfig dependencies
nvmet: fix return value check in nvmet_subsys_alloc()
...
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts
- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph
- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives
- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff
- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin
- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
Pull libata updates from Tejun Heo:
"libata saw quite a bit of activities in this cycle:
- SMR drive support still being worked on
- bug fixes and improvements to misc SCSI command emulation
- some low level driver updates"
* 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (39 commits)
libata-scsi: better style in ata_msense_*()
AHCI: Clear GHC.IS to prevent unexpectly asserting INTx
ata: sata_dwc_460ex: remove redundant dev_err call
ata: define ATA_PROT_* in terms of ATA_PROT_FLAG_*
libata: remove ATA_PROT_FLAG_DATA
libata: remove ata_is_nodata
ata: make lba_{28,48}_ok() use ATA_MAX_SECTORS{,_LBA48}
libata-scsi: minor cleanup for ata_scsi_zbc_out_xlat
libata-scsi: Fix ZBC management out command translation
libata-scsi: Fix translation of REPORT ZONES command
ata: Handle ATA NCQ NO-DATA commands correctly
libata-eh: decode all taskfile protocols
ata: fixup ATA_PROT_NODATA
libsas: use ata_is_ncq() and ata_has_dma() accessors
libata: use ata_is_ncq() accessors
libata: return boolean values from ata_is_*
libata-scsi: avoid repeated calculation of number of TRIM ranges
libata-scsi: reject WRITE SAME (16) with n_block that exceeds limit
libata-scsi: rename ata_msense_ctl_mode() to ata_msense_control()
libata-scsi: fix D_SENSE bit relection in control mode page
...
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The recent change to tracepoint napi:napi_poll changed the order of
the parameters that perf scripts sees, the printk was correct. The
problem was that the new parameters (work and budget) were pushed
in front of dev_name.
The new parameters obviously need to be appended to keep backward
compatible.
Fixes: 1db19db7f5 ("net: tracepoint napi:napi_poll add work and budget")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new taskfile protocol ATA_PROT_NCQ_NODATA to handle
ATA NCQ NO-DATA commands correctly.
And fixup ata_scsi_zbc_out_xlat() to use it.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Including devlink.h on ARM and probably other 32-bit architectures results in
a harmless warning:
In file included from ../include/trace/define_trace.h:95:0,
from ../include/trace/events/devlink.h:51,
from ../net/core/devlink.c:30:
include/trace/events/devlink.h: In function 'trace_raw_output_devlink_hwmsg':
include/trace/events/devlink.h:42:12: error: format '%lu' expects argument of type 'long unsigned int', but argument 10 has type 'size_t {aka unsigned int}' [-Werror=format=]
The correct format string for 'size_t' is %zu, not %lu, this works on all
architectures.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: e5224f0fe2 ("devlink: add hardware messages tracing facility")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned on that driver->owner which is struct module is not available when
modules are disabled. Better to depend on a driver name which is
always available.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: e5224f0fe2 ("devlink: add hardware messages tracing facility")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define a tracepoint and allow user to trace messages going to and from
hardware associated with devlink instance.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
An important information for the napi_poll tracepoint is knowing
the work done (packets processed) by the napi_poll() call. Add
both the work done and budget, as they are related.
Handle trace_napi_poll() param change in dropwatch/drop_monitor
and in python perf script netdev-times.py in backward compat way,
as python fortunately supports optional parameter handling.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch drops the compat definition of req_op where it matches
the rq_flag_bits definitions, and drops the related old and compat
code that allowed users to set either the op or flags for the operation.
We also then store the operation in the bi_rw/cmd_flags field similar
to how we used to store the bio ioprio where it sat in the upper bits
of the field.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Have blktrace use the req/bio op accessor to get the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have f2fs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(kvm_stat had nothing to do with QEMU in the first place -- the tool
only interprets debugfs)
- expose per-vm statistics in debugfs and support them in kvm_stat
(KVM always collected per-vm statistics, but they were summarised into
global statistics)
x86:
- fix dynamic APICv (VMX was improperly configured and a guest could
access host's APIC MSRs, CVE-2016-4440)
- minor fixes
ARM changes from Christoffer Dall:
"This set of changes include the new vgic, which is a reimplementation
of our horribly broken legacy vgic implementation. The two
implementations will live side-by-side (with the new being the
configured default) for one kernel release and then we'll remove the
legacy one.
Also fixes a non-critical issue with virtual abort injection to
guests."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCAAGBQJXRz0KAAoJEED/6hsPKofosiMIAIHmRI+9I6VMNmQe5vrZKz9/
vt89QGxDJrFQwhEuZovenLEDaY6rMIJNguyvIbPhNuXNHIIPWbe6cO6OPwByqkdo
WI/IIqcAJN/Bpwt4/Y2977A5RwDOwWLkaDs0LrZCEKPCgeh9GWQf+EfyxkDJClhG
uIgbSAU+t+7b05K3c6NbiQT/qCzDTCdl6In6PI/DFSRRkXDaTcopjjp1PmMUSSsR
AM8LGhEzMer+hGKOH7H5TIbN+HFzAPjBuDGcoZt0/w9IpmmS5OMd3ZrZ320cohz8
zZQooRcFrT0ulAe+TilckmRMJdMZ69fyw3nzfqgAKEx+3PaqjKSY/tiEgqqDJHY=
=EEBK
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of KVM updates from Radim Krčmář:
"General:
- move kvm_stat tool from QEMU repo into tools/kvm/kvm_stat (kvm_stat
had nothing to do with QEMU in the first place -- the tool only
interprets debugfs)
- expose per-vm statistics in debugfs and support them in kvm_stat
(KVM always collected per-vm statistics, but they were summarised
into global statistics)
x86:
- fix dynamic APICv (VMX was improperly configured and a guest could
access host's APIC MSRs, CVE-2016-4440)
- minor fixes
ARM changes from Christoffer Dall:
- new vgic reimplementation of our horribly broken legacy vgic
implementation. The two implementations will live side-by-side
(with the new being the configured default) for one kernel release
and then we'll remove the legacy one.
- fix for a non-critical issue with virtual abort injection to guests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
tools: kvm_stat: Add comments
tools: kvm_stat: Introduce pid monitoring
KVM: Create debugfs dir and stat files for each VM
MAINTAINERS: Add kvm tools
tools: kvm_stat: Powerpc related fixes
tools: Add kvm_stat man page
tools: Add kvm_stat vm monitor script
kvm:vmx: more complete state update on APICv on/off
KVM: SVM: Add more SVM_EXIT_REASONS
KVM: Unify traced vector format
svm: bitwise vs logical op typo
KVM: arm/arm64: vgic-new: Synchronize changes to active state
KVM: arm/arm64: vgic-new: enable build
KVM: arm/arm64: vgic-new: implement mapped IRQ handling
KVM: arm/arm64: vgic-new: Wire up irqfd injection
KVM: arm/arm64: vgic-new: Add vgic_v2/v3_enable
KVM: arm/arm64: vgic-new: vgic_init: implement map_resources
KVM: arm/arm64: vgic-new: vgic_init: implement vgic_init
KVM: arm/arm64: vgic-new: vgic_init: implement vgic_create
KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init
...
Specifically the change from hex to decimal helps correlating events.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull libata ZAC support from Tejun Heo:
"This contains Zone ATA Command support for Shingled Magnetic Recording
devices.
In addition to sending the new commands down to the device, as ZAC
commands depend on getting a lot of responses from the device, piping
up responses is beefed up too. However, it doesn't involve changes to
libata core mechanism or its interaction with upper layers, so I'm not
expecting too many fallouts.
Kudos to Hannes for driving SMR support"
* 'for-4.7-zac' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (28 commits)
libata: support host-aware and host-managed ZAC devices
libata: support device-managed ZAC devices
libata: NCQ encapsulation for ZAC MANAGEMENT OUT
libata: Implement ZBC OUT translation
libata: implement ZBC IN translation
libata: fixup ZAC device disabling
libata-scsi: Generate sense code for disabled devices
libata-trace: decode subcommands
libata: Check log page directory before accessing pages
libata: Add command definitions for NCQ Encapsulation for READ LOG DMA EXT
libata: Separate out ata_dev_config_ncq_send_recv()
libata/libsas: Define ATA_CMD_NCQ_NON_DATA
libsas: enable FPDMA SEND/RECEIVE
libata: do not attempt to retrieve sense code twice
libata-scsi: Set information sense field for invalid parameter
libata-scsi: set bit pointer for sense code information
libata-scsi: Set field pointer in sense code
scsi: add scsi_set_sense_field_pointer()
libata: Implement control mode page to select sense format
libata-scsi: generate correct ATA pass-through sense
...
- fs-specific prefix for fscrypto
- fault injection facility
- expose validity bitmaps for user to be aware of fragmentation
- fallocate/rm/preallocation speed up
- use percpu counters
Bug fixes
- some inline_dentry/inline_data bugs
- error handling for atomic/volatile/orphan inodes
- recover broken superblock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQPu4AAoJEEAUqH6CSFDSILgP/1dj6fmtytr8c+55EBqXUGpo
M7rS93JTxlmU5BduIo9psJsEquTQoVEmxB/Gjd+ZnI5R6Rp1c/REaP0ba374rEhZ
ecMQh5QqzM1gRNFXrQhWFEL/KtfRqt3T80zebQP7pxFUm/m9NGMLWT43RzQ8AAhr
Y3P0NLdvxA4HAnipKptkPJcGZQlWnL9W/MR+LgsXLXqLDwJHkVu61GcF0y2ibcJM
lEtIRmyH5tg7hP5c5LTw9pKQFHkIZt5cHFLjrJ1x8FSm2TXOcJPbjOrThvcb+NKK
e0O+6R0meH2eMpak+BTkZp2YbPPyXOb1N00j//lmbPjCoJPd4ZuiJ+oRoHUlTxtU
FhO67t0brlDbMFQVRFrtv8VA8M6by+DTAAP3Ffx62I/TJkphKANCSoyQRhlWtxxO
kRU69N7ipnRNxO4WCv40FjaQjSIElCKysP1POazRmAOQm7UFTGT9Nj37+eqUcEPJ
HZ7O61DEHNemb0SMlJ8WSClstt0yUU+2cjRfTPAr2Wd3V8gYbRs0QUg5M2GLgywR
EmiJfpkXse3f/nR8W6g1hganSOXA0AZX+EUibed6VkV3oYemdFbm8OymeEmLmWpM
y2F3D7dPLW7MCoTXJqtwFWdoDwI+zkH4rJaPGTq5TVBRWVU/njX8OvoB47pOvKV1
kccL7zv2PekE1hSDO5WF
=6MSp
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, as Ted pointed out, fscrypto allows one more key prefix
given by filesystem to resolve backward compatibility issues. Other
than that, we've fixed several error handling cases by introducing
a fault injection facility. We've also achieved performance
improvement in some workloads as well as a bunch of bug fixes.
Summary:
Enhancements:
- fs-specific prefix for fscrypto
- fault injection facility
- expose validity bitmaps for user to be aware of fragmentation
- fallocate/rm/preallocation speed up
- use percpu counters
Bug fixes:
- some inline_dentry/inline_data bugs
- error handling for atomic/volatile/orphan inodes
- recover broken superblock"
* tag 'for-f2fs-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (73 commits)
f2fs: fix to update dirty page count correctly
f2fs: flush pending bios right away when error occurs
f2fs: avoid ENOSPC fault in the recovery process
f2fs: make exit_f2fs_fs more clear
f2fs: use percpu_counter for total_valid_inode_count
f2fs: use percpu_counter for alloc_valid_block_count
f2fs: use percpu_counter for # of dirty pages in inode
f2fs: use percpu_counter for page counters
f2fs: use bio count instead of F2FS_WRITEBACK page count
f2fs: manipulate dirty file inodes when DATA_FLUSH is set
f2fs: add fault injection to sysfs
f2fs: no need inc dirty pages under inode lock
f2fs: fix incorrect error path handling in f2fs_move_rehashed_dirents
f2fs: fix i_current_depth during inline dentry conversion
f2fs: correct return value type of f2fs_fill_super
f2fs: fix deadlock when flush inline data
f2fs: avoid f2fs_bug_on during recovery
f2fs: show # of orphan inodes
f2fs: support in batch fzero in dnode page
f2fs: support in batch multi blocks preallocation
...
COMPACT_COMPLETE now means that compaction and free scanner met. This
is not very useful information if somebody just wants to use this
feedback and make any decisions based on that. The current caller might
be a poor guy who just happened to scan tiny portion of the zone and
that could be the reason no suitable pages were compacted. Make sure we
distinguish the full and partial zone walks.
Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.
The existing users of COMPACT_COMPLETE are conservatively changed to use
COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.
This patch shouldn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_to_compact_pages() can currently return COMPACT_SKIPPED even when
the compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.
Make sure we distinguish those two states properly and switch their
ordering in the enum. This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.
As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- x86: miscellaneous fixes, AVIC support (local APIC virtualization,
AMD version)
- s390: polling for interrupts after a VCPU goes to halted state is
now enabled for s390; use hardware provided information about facility
bits that do not need any hypervisor activity, and other fixes for
cpu models and facilities; improve perf output; floating interrupt
controller improvements.
- MIPS: miscellaneous fixes
- PPC: bugfixes only
- ARM: 16K page size support, generic firmware probing layer for
timer and GIC
Christoffer Dall (KVM-ARM maintainer) says:
"There are a few changes in this pull request touching things outside
KVM, but they should all carry the necessary acks and it made the
merge process much easier to do it this way."
though actually the irqchip maintainers' acks didn't make it into the
patches. Marc Zyngier, who is both irqchip and KVM-ARM maintainer,
later acked at http://mid.gmane.org/573351D1.4060303@arm.com
"more formally and for documentation purposes".
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJXPJjyAAoJEL/70l94x66DhioH/j4fwQ0FmfPSM9PArzaFHQdx
LNE3tU4+bobbsy1BJr4DiAaOUQn3DAgwUvGLWXdeLiOXtoWXBiFHKaxlqEsCA6iQ
xcTH1TgfxsVoqGQ6bT9X/2GCx70heYpcWG3f+zqBy7ZfFmQykLAC/HwOr52VQL8f
hUFi3YmTHcnorp0n5Xg+9r3+RBS4D/kTbtdn6+KCLnPJ0RcgNkI3/NcafTemoofw
Tkv8+YYFNvKV13qlIfVqxMa0GwWI3pP6YaNKhaS5XO8Pu16HuuF1JthJsUBDzwBa
RInp8R9MoXgsBYhLpz3jc9vWG7G9yDl5LehsD9KOUGOaFYJ7sQN+QZOusa6jFgA=
=llO5
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Small release overall.
x86:
- miscellaneous fixes
- AVIC support (local APIC virtualization, AMD version)
s390:
- polling for interrupts after a VCPU goes to halted state is now
enabled for s390
- use hardware provided information about facility bits that do not
need any hypervisor activity, and other fixes for cpu models and
facilities
- improve perf output
- floating interrupt controller improvements.
MIPS:
- miscellaneous fixes
PPC:
- bugfixes only
ARM:
- 16K page size support
- generic firmware probing layer for timer and GIC
Christoffer Dall (KVM-ARM maintainer) says:
"There are a few changes in this pull request touching things
outside KVM, but they should all carry the necessary acks and it
made the merge process much easier to do it this way."
though actually the irqchip maintainers' acks didn't make it into the
patches. Marc Zyngier, who is both irqchip and KVM-ARM maintainer,
later acked at http://mid.gmane.org/573351D1.4060303@arm.com ('more
formally and for documentation purposes')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (82 commits)
KVM: MTRR: remove MSR 0x2f8
KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same
svm: Manage vcpu load/unload when enable AVIC
svm: Do not intercept CR8 when enable AVIC
svm: Do not expose x2APIC when enable AVIC
KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore
svm: Add VMEXIT handlers for AVIC
svm: Add interrupt injection via AVIC
KVM: x86: Detect and Initialize AVIC support
svm: Introduce new AVIC VMCB registers
KVM: split kvm_vcpu_wake_up from kvm_vcpu_kick
KVM: x86: Introducing kvm_x86_ops VCPU blocking/unblocking hooks
KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks
KVM: x86: Rename kvm_apic_get_reg to kvm_lapic_get_reg
KVM: x86: Misc LAPIC changes to expose helper functions
KVM: shrink halt polling even more for invalid wakeups
KVM: s390: set halt polling to 80 microseconds
KVM: halt_polling: provide a way to qualify wakeups during poll
KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts
kvm: Conditionally register IRQ bypass consumer
...
This patch includes the usual quota of driver updates (bnx2fc, mp3sas,
hpsa, ncr5380, lpfc, hisi_sas, snic, aacraid, megaraid_sas) there's
also a multiqueue update for scsi_debug, assorted bug fixes and a few
other minor updates (refactor of scsi_sg_pools into generic code, alua
and VPD updates, and struct timeval conversions).
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJXO8W0AAoJEDeqqVYsXL0MW24H/jGWwfjsDUiSsLwbLca6DWu8
ZCWZ7rSZ27CApwGPgZGpLvUg+vpW8Ykm2zdeBnlZ6ScXS+dT3uo/PHsnemsTextj
6glQNIOFY0Ja2GwkkN00M6IZQhTJ628cqJKIEJxC68lIw16wiOwjZaK68GMrusDO
Sl062rkuLR6Jb2T+YoT/sD8jQfWlSj2V9e9rqJoS/rIbS6B+hUipuybz2yQ2yK2u
XFc30yal9oVz1fHEoh2O8aqckW3/iskukVXVuZ0MQzT/lV/bm9I6AnWVHw7d0Yhp
ZELjXpjx5M2Z/d8k0Wvx1e25oL/ERwa96yLnTvRcqyF5Yt1EgAhT+jKvo4pnGr8=
=L6y/
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"First round of SCSI updates for the 4.6+ merge window.
This batch includes the usual quota of driver updates (bnx2fc, mp3sas,
hpsa, ncr5380, lpfc, hisi_sas, snic, aacraid, megaraid_sas). There's
also a multiqueue update for scsi_debug, assorted bug fixes and a few
other minor updates (refactor of scsi_sg_pools into generic code, alua
and VPD updates, and struct timeval conversions)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (138 commits)
mpt3sas: Used "synchronize_irq()"API to synchronize timed-out IO & TMs
mpt3sas: Set maximum transfer length per IO to 4MB for VDs
mpt3sas: Updating mpt3sas driver version to 13.100.00.00
mpt3sas: Fix initial Reference tag field for 4K PI drives.
mpt3sas: Handle active cable exception event
mpt3sas: Update MPI header to 2.00.42
Revert "lpfc: Delete unnecessary checks before the function call mempool_destroy"
eata_pio: missing break statement
hpsa: Fix type ZBC conditional checks
scsi_lib: Decode T10 vendor IDs
scsi_dh_alua: do not fail for unknown VPD identification
scsi_debug: use locally assigned naa
scsi_debug: uuid for lu name
scsi_debug: vpd and mode page work
scsi_debug: add multiple queue support
bfa: fix bfa_fcb_itnim_alloc() error handling
megaraid_sas: Downgrade two success messages to info
cxlflash: Fix to resolve dead-lock during EEH recovery
scsi_debug: rework resp_report_luns
scsi_debug: use pdt constants
...
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull parallel filesystem directory handling update from Al Viro.
This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.
The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.
The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).
A more detailed explanation of the process from Al Viro:
"The xattr work starts with some acl fixes, then switches ->getxattr to
passing inode and dentry separately. This is the point where the
things start to get tricky - that got merged into the very beginning
of the -rc3-based #work.lookups, to allow untangling the
security_d_instantiate() mess. The xattr work itself proceeds to
switch a lot of filesystems to generic_...xattr(); no complications
there.
After that initial xattr work, the series then does the following:
- untangle security_d_instantiate()
- convert a bunch of open-coded lookup_one_len_unlocked() to calls of
that thing; one such place (in overlayfs) actually yields a trivial
conflict with overlayfs fixes later in the cycle - overlayfs ended
up switching to a variant of lookup_one_len_unlocked() sans the
permission checks. I would've dropped that commit (it gets
overridden on merge from #ovl-fixes in #for-next; proper resolution
is to use the variant in mainline fs/overlayfs/super.c), but I
didn't want to rebase the damn thing - it was fairly late in the
cycle...
- some filesystems had managed to depend on lookup/lookup exclusion
for *fs-internal* data structures in a way that would break if we
relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.
- core of that series - parallel lookup machinery, replacing
->i_mutex with rwsem, making lookup_slow() take it only shared. At
that point lookups happen in parallel; lookups on the same name
wait for the in-progress one to be done with that dentry.
Surprisingly little code, at that - almost all of it is in
fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
making it use the new primitive and actually switching to locking
shared.
- parallel readdir stuff - first of all, we provide the exclusion on
per-struct file basis, same as we do for read() vs lseek() for
regular files. That takes care of most of the needed exclusion in
readdir/readdir; however, these guys are trickier than lookups, so
I went for switching them one-by-one. To do that, a new method
'->iterate_shared()' is added and filesystems are switched to it
as they are either confirmed to be OK with shared lock on directory
or fixed to be OK with that. I hope to kill the original method
come next cycle (almost all in-tree filesystems are switched
already), but it's still not quite finished.
- several filesystems get switched to parallel readdir. The
interesting part here is dealing with dcache preseeding by readdir;
that needs minor adjustment to be safe with directory locked only
shared.
Most of the filesystems doing that got switched to in those
commits. Important exception: NFS. Turns out that NFS folks, with
their, er, insistence on VFS getting the fuck out of the way of the
Smart Filesystem Code That Knows How And What To Lock(tm) have
grown the locking of their own. They had their own homegrown
rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
is the reader there). Of course, with VFS getting the fuck out of
the way, as requested, the actual smarts of the smart filesystem
code etc. had become exposed...
- do_last/lookup_open/atomic_open cleanups. As the result, open()
without O_CREAT locks the directory only shared. Including the
->atomic_open() case. Backmerge from #for-linus in the middle of
that - atomic_open() fix got brought in.
- then comes NFS switch to saner (VFS-based ;-) locking, killing the
homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
exclusion for sillyunlink/lookup is done by the parallel lookups
mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
now - rmdir being the writer.
Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
now.
- the rest of the series consists of switching a lot of filesystems
to parallel readdir; in a lot of cases ->llseek() gets simplified
as well. One backmerge in there (again, #for-linus - rockridge
fix)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
ext4: switch to ->iterate_shared()
hfs: switch to ->iterate_shared()
hfsplus: switch to ->iterate_shared()
hostfs: switch to ->iterate_shared()
hpfs: switch to ->iterate_shared()
hpfs: handle allocation failures in hpfs_add_pos()
gfs2: switch to ->iterate_shared()
f2fs: switch to ->iterate_shared()
afs: switch to ->iterate_shared()
befs: switch to ->iterate_shared()
befs: constify stuff a bit
isofs: switch to ->iterate_shared()
get_acorn_filename(): deobfuscate a bit
btrfs: switch to ->iterate_shared()
logfs: no need to lock directory in lseek
switch ecryptfs to ->iterate_shared
9p: switch to ->iterate_shared()
fat: switch to ->iterate_shared()
romfs, squashfs: switch to ->iterate_shared()
more trivial ->iterate_shared conversions
...
- Add TRACE support to be able to debug request flow
- Extend/improve reset support for (e)MMC
- Convert MMC pwrseq to platform device drivers
- Use IDA for indexes
- Some additional minor improvements
MMC host:
- sdhci: Re-factoring, clean-ups and improvements
- sdhci-acpi|pci: Use MMC_CAP_AGGRESSIVE_PM for Broxton
- omap/omap_hsmmc: Convert to use dma_request_chan()
- usdhi6rol0: Add support for UHS modes
- sh_mmcif: Update runtime PM support
- tmio: Wolfram Sang steps in as maintainer
- tmio: Add UHS-I mode support
- sh_mobile_sdhi: Add UHS-I mode support
- tmio/sdhi: Re-factoring, clean-ups and improvements
- dw_mmc: Re-factoring and clean-ups
- davinci: Convert to use dma_request_chan()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXObOWAAoJEP4mhCVzWIwp3P4QANEb2z7NgUOw3DTti87r05gj
N3PNafNIn7EjrtuBenaVNZUkGnjnzVanNYEMArGFIdeVhJ/ZCCJY1fUOK161NUmO
1zRCOSufD9mRmhNtKb7jKu1YboXPRyKDaPVBTSSVrQPBw699tALGHCyAFvgFFKPD
RvTPHSvH1vTy0VF50/ao/vl1ci89nxp6PBG/5xe1rorBHH1CYaiPgWtniMqc09Ix
LiAO8Ox7fNd4WgK1tO56xJgEN2WA+Pbqy/7UabO+OjXoAMbPmO/l8vP0/9MqlBaX
WZyDVwusQ9VhyDMsQ6tpZa6k8G3u3LOeolZWHKQqHpJYbNuwP+szh4gdJRb838CC
AIz9UWC35ERn7yYD0aL5ok0TQhf4NJhJZibbGT2zNtnUVaSJnrJsqNtQOtEVLI9v
KxzSiKsAAC0fGpyvze3/yU4JXc1yJd8EXm1iakF5KYBimC+wzVRqQmuDUPrLjTG5
iypctu+yqb2OXmKbedsCruJ7nnLYAcGFKAaUSvCxn7AO4e44YEU7VIeWdC+NO6+s
vf9HNfKwiorw2mkYNcfnJgTjzqXhimOp+94WAOUBMhi1w+OZ1TUlSriTyBbK3s1G
rb4I37T7oLZIpDitfvmra9ORqNyUr0AlG3728BScN/Rc3731uEIBRd11h32hUoXk
b8a9ORVfHZHMrv5+5T0N
=89kT
-----END PGP SIGNATURE-----
Merge tag 'mmc-v4.7' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Add TRACE support to be able to debug request flow
- Extend/improve reset support for (e)MMC
- Convert MMC pwrseq to platform device drivers
- Use IDA for indexes
- Some additional minor improvements
MMC host:
- sdhci: Re-factoring, clean-ups and improvements
- sdhci-acpi|pci: Use MMC_CAP_AGGRESSIVE_PM for Broxton
- omap/omap_hsmmc: Convert to use dma_request_chan()
- usdhi6rol0: Add support for UHS modes
- sh_mmcif: Update runtime PM support
- tmio: Wolfram Sang steps in as maintainer
- tmio: Add UHS-I mode support
- sh_mobile_sdhi: Add UHS-I mode support
- tmio/sdhi: Re-factoring, clean-ups and improvements
- dw_mmc: Re-factoring and clean-ups
- davinci: Convert to use dma_request_chan()"
* tag 'mmc-v4.7' of git://git.linaro.org/people/ulf.hansson/mmc: (99 commits)
mmc: mmc: Fix partition switch timeout for some eMMCs
mmc: sh_mobile_sdhi: enable SDIO IRQs for RCar Gen3
mmc: sdio: fall back to SDIO 1.0 for broken 1.1 cards
mmc: sdhci-st: correct name of sd-uhs-sdr50 property
MAINTAINERS: update entry for TMIO MMC driver
mmc: block: improve logging of handling emmc timeouts
mmc: sdhci: removed unneeded function wrappers
mmc: core: remove the invalid message in mmc_select_timing
mmc: core: fix using wrong io voltage if mmc_select_hs200 fails
mmc: sdhci-of-arasan: fix set_clock when a phy is supported
mmc: omap: Use dma_request_chan() for requesting DMA channel
mmc: mmc: Attempt to flush cache before reset
mmc: sh_mobile_sdhi: check return value when changing clk
mmc: sh_mobile_sdhi: only change the clock on RCar Gen2+
mmc: tmio/sdhi: introduce flag for RCar 2+ specific features
mmc: sh_mobile_sdhi: make clk_update function more compact
mmc: omap_hsmmc: Use dma_request_chan() for requesting DMA channel
mmc: sdhci-of-at91: add presets setup
mmc: usdhi6rol0: add pinctrl to set pin drive strength
mmc: usdhi6rol0: add support for UHS modes
...
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.
For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.
This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.
This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch introduces reserve_new_blocks to make preallocation of multi
blocks as in batch operation, so it can avoid lots of redundant
operation, result in better performance.
In virtual machine, with rotational device:
time fallocate -l 32G /mnt/f2fs/file
Before:
real 0m4.584s
user 0m0.000s
sys 0m4.580s
After:
real 0m0.292s
user 0m0.000s
sys 0m0.272s
In x86, with SSD:
time fallocate -l 500G $MNT/testfile
Before : 24.758 s
After : 1.604 s
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix bugs and add performance numbers measured in x86.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
ZAC drives implement a 'ZAC Management Out' command template,
which maps onto the ZBC OUT command.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
ZAC drives implement a 'ZAC Management In' command template,
which maps onto the ZBC IN command.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some commands like FPDMA RECEIVE or NCQ NON DATA can encapsulate
other commands to NCQ transport. So decode the subcmds, too.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Define the NCQ NON DATA command and update libsas to handle it
correctly.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch provides some tracepoints for the lifecycle of a mmc request
from starting to completion to help with performance analysis of MMC
subsystem.
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Pull RCU updates from Paul E. McKenney:
* Documentation updates, including fixes to the design-level
requirements documentation and a fixed version of the design-level
data-structure documentation. These fixes include removing
cartoons and getting rid of the html/htmlx duplication.
* Further improvements to the new-age expedited grace periods.
* Miscellaneous fixes.
* Torture-test changes, including a new rcuperf module for measuring
RCU grace-period performance and scalability, which is useful for
the expedited-grace-period changes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
move trace_call_bpf() into helper function to minimize the size
of perf_trace_*() tracepoint handlers.
text data bss dec hex filename
10541679 5526646 2945024 19013349 1221ee5 vmlinux_before
10509422 5526646 2945024 18981092 121a0e4 vmlinux_after
It may seem that perf_fetch_caller_regs() can also be moved,
but that is incorrect, since ip/sp will be wrong.
bpf+tracepoint performance is not affected, since
perf_swevent_put_recursion_context() is now inlined.
export_symbol_gpl can also be dropped.
No measurable change in normal perf tracepoints.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new trace functions for ZBC_IN and ZBC_OUT.
Reviewed-by: Doug Gilbert <dgilbert@interlog.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
scsi_opcode_name() is displaying the opcode, not the service
action.
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull btrfs fixes from Chris Mason:
"These are bug fixes, including a really old fsync bug, and a few trace
points to help us track down problems in the quota code"
* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix file/data loss caused by fsync after rename and new inode
btrfs: Reset IO error counters before start of device replacing
btrfs: Add qgroup tracing
Btrfs: don't use src fd for printk
btrfs: fallback to vmalloc in btrfs_compare_tree
btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
btrfs: Output more info for enospc_debug mount option
Btrfs: fix invalid reference in replace_path
Btrfs: Improve FL_KEEP_SIZE handling in fallocate
introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached
to the perf tracepoint handler, which will copy the arguments into
the per-cpu buffer and pass it to the bpf program as its first argument.
The layout of the fields can be discovered by doing
'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
prior to the compilation of the program with exception that first 8 bytes
are reserved and not accessible to the program. This area is used to store
the pointer to 'struct pt_regs' which some of the bpf helpers will use:
+---------+
| 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
+---------+
| N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
+---------+
| dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
+---------+
Not that all of the fields are already dumped to user space via perf ring buffer
and broken application access it directly without consulting tracepoint/format.
Same rule applies here: static tracepoint fields should only be accessed
in a format defined in tracepoint/format. The order of fields and
field sizes are not an ABI.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>