Problem similar to ones dealt with in "fold checks into iterate_and_advance()"
and followups, except that in this case we really want to do nothing when
asked for zero-length operation - unlike zero-length iterate_and_advance(),
zero-length iterate_all_kinds() has no side effects, and callers are simpler
that way.
That got exposed when copy_from_iter_full() had been used by tipc, which
builds an msghdr with zero payload and (now) feeds it to a primitive
based on iterate_all_kinds() instead of iterate_and_advance().
Reported-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On architectures like arm64, swiotlb is tied intimately to the core
architecture DMA support. In addition, ZONE_DMA cannot be disabled.
To aid debugging and catch devices not supporting DMA to memory outside
the 32-bit address space, add a kernel command line option
"swiotlb=noforce", which disables the use of bounce buffers.
If specified, trying to map memory that cannot be used with DMA will
fail, and a rate-limited warning will be printed.
Note that io_tlb_nslabs is set to 1, which is the minimal supported
value.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Convert the flag swiotlb_force from an int to an enum, to prepare for
the advent of more possible values.
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull vfs updates from Al Viro:
- more ->d_init() stuff (work.dcache)
- pathname resolution cleanups (work.namei)
- a few missing iov_iter primitives - copy_from_iter_full() and
friends. Either copy the full requested amount, advance the iterator
and return true, or fail, return false and do _not_ advance the
iterator. Quite a few open-coded callers converted (and became more
readable and harder to fuck up that way) (work.iov_iter)
- several assorted patches, the big one being logfs removal
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
logfs: remove from tree
vfs: fix put_compat_statfs64() does not handle errors
namei: fold should_follow_link() with the step into not-followed link
namei: pass both WALK_GET and WALK_MORE to should_follow_link()
namei: invert WALK_PUT logics
namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
namei: saner calling conventions for mountpoint_last()
namei.c: get rid of user_path_parent()
switch getfrag callbacks to ..._full() primitives
make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
[iov_iter] new primitives - copy_from_iter_full() and friends
don't open-code file_inode()
ceph: switch to use of ->d_init()
ceph: unify dentry_operations instances
lustre: switch to use of ->d_init()
[ This resurrects commit 53855d10f4, which was reverted in
2b41226b39. It depended on commit d544abd5ff ("lib/radix-tree:
Convert to hotplug state machine") so now it is correct to apply ]
Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test
suite as it adds a call to cpuhp_setup_state_nocalls() which is not
currently emulated in the test suite. Add it, and delete the emulation
of the old CPU hotplug mechanism.
Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
radix_tree_join() was freeing nodes with a non-zero ->exceptional count,
and radix_tree_split() wasn't zeroing ->exceptional when it allocated
the new node. Fix this by making all callers of radix_tree_node_alloc()
pass in the new counts (and some other always-initialised fields), which
will prevent the problem recurring if in future we decide to do
something similar.
Link: http://lkml.kernel.org/r/1481667692-14500-3-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When replacing an entry with NULL, we need to delete any sibling
entries. Also account deleting exceptional entries properly. Also fix
a bug with radix_tree_iter_replace() where we would fail to remove
entirely freed nodes. Also fix accounting bug when switching between
normal and exceptional entries with replace_slot. Also add testcases
for all these bugs.
Link: http://lkml.kernel.org/r/1480369871-5271-61-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calculate how many nodes we need to allocate to split an old_order entry
into multiple entries, each of size new_order. The test suite checks
that we allocated exactly the right number of nodes; neither too many
(checked by rtp->nr == 0), nor too few (checked by comparing
nr_allocated before and after the call to radix_tree_split()).
Link: http://lkml.kernel.org/r/1480369871-5271-60-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This new function splits a larger multiorder entry into smaller entries
(potentially multi-order entries). These entries are initialised to
RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
confused. The caller should then call radix_tree_for_each_slot() and
radix_tree_replace_slot() in order to turn these retry entries into the
intended new entries. Tags are replicated from the original multiorder
entry into each new entry.
Link: http://lkml.kernel.org/r/1480369871-5271-59-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This new function allows for the replacement of many smaller entries in
the radix tree with one larger multiorder entry. From the point of view
of an RCU walker, they may see a mixture of the smaller entries and the
large entry during the same walk, but they will never see NULL for an
index which was populated before the join.
Link: http://lkml.kernel.org/r/1480369871-5271-58-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an exceptionally complicated function with just one caller
(tag_pages_for_writeback). We devote a large portion of the runtime of
the test suite to testing this one function which has one caller. By
introducing the new function radix_tree_iter_tag_set(), we can eliminate
all of the complexity while keeping the performance. The caller can now
use a fairly standard radix_tree_for_each() loop, and it doesn't need to
worry about tricksy things like 'start' wrapping.
The test suite continues to spend a large amount of time investigating
this function, but now it's testing the underlying primitives such as
radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator
which are also used by other parts of the kernel.
Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This rather complicated function can be better implemented as an
iterator. It has only one caller, so move the functionality to the only
place that needs it. Update the test suite to follow the same pattern.
Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes several interlinked problems with the iterators in the
presence of multiorder entries.
1. radix_tree_iter_next() would only advance by one slot, which would
result in the iterators returning the same entry more than once if
there were sibling entries.
2. radix_tree_next_slot() could return an internal pointer instead of
a user pointer if a tagged multiorder entry was immediately followed by
an entry of lower order.
3. radix_tree_next_slot() expanded to a lot more code than it used to
when multiorder support was compiled in. And I wasn't comfortable with
entry_to_node() being in a header file.
Fixing radix_tree_iter_next() for the presence of sibling entries
necessarily involves examining the contents of the radix tree, so we now
need to pass 'slot' to radix_tree_iter_next(), and we need to change the
calling convention so it is called *before* dropping the lock which
protects the tree. Also rename it to radix_tree_iter_resume(), as some
people thought it was necessary to call radix_tree_iter_next() each time
around the loop.
radix_tree_next_slot() becomes closer to how it looked before multiorder
support was introduced. It only checks to see if the next entry in the
chunk is a sibling entry or a pointer to a node; this should be rare
enough that handling this case out of line is not a performance impact
(and such impact is amortised by the fact that the entry we just
processed was a multiorder entry). Also, radix_tree_next_slot() used to
force a new chunk lookup for untagged entries, which is more expensive
than the out of line sibling entry skipping.
Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print the indices of the entries as unsigned (instead of signed)
integers and print the parent node of each entry to help navigate around
larger trees where the layout is not quite so obvious. Print the
indices covered by a node. Rearrange the order of fields printed so the
indices and parents line up for each type of entry.
Link: http://lkml.kernel.org/r/1480369871-5271-53-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since this function is specialised to the radix tree, pass in the node
and tag to calculate the address of the bitmap in
radix_tree_find_next_bit() instead of the caller. Likewise, there is no
need to pass in the size of the bitmap.
Link: http://lkml.kernel.org/r/1480369871-5271-52-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to node_tag_clear(), factor node_tag_set() out of
radix_tree_range_tag_if_tagged().
Link: http://lkml.kernel.org/r/1480369871-5271-51-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I want to be able to reference node->parent after freeing node.
Currently node->parent is in a union with rcu_head, so it is overwritten
when the node is put on the RCU list. We know that private_list is not
referenced after the node is freed, so it is safe for these two members
to share space.
Link: http://lkml.kernel.org/r/1480369871-5271-50-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documenation/ubsan.txt was moved to Documentation/dev-tools/ubsan.rst,
this fixes the reference.
Link: http://lkml.kernel.org/r/1476698152-29340-3-git-send-email-andreas.platschek@opentech.at
Signed-off-by: Andreas Platschek <andreas.platschek@opentech.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation on development tools was moved to Documentation/devl-tools
and sphinxified (renamed from .txt to .rst).
References in lib/Kconfig.debug need to be updated to the new location.
Link: http://lkml.kernel.org/r/1476698152-29340-2-git-send-email-andreas.platschek@opentech.at
Signed-off-by: Andreas Platschek <andreas.platschek@opentech.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull MD updates from Shaohua Li:
- a raid5 writeback cache feature.
The goal is to aggregate writes to make full stripe write and reduce
read-modify-write. It's helpful for workload which does sequential
write and follows fsync for example. This feature is experimental and
off by default right now.
- FAILFAST support.
This fails IOs to broken raid disks quickly, so can improve latency.
It's mainly for DASD storage, but some patches help normal raid array
too.
- support bad block for raid array with external metadata
- AVX2 instruction support for raid6 parity calculation
- normalize MD info output
- add missing blktrace
- other bug fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (66 commits)
md: separate flags for superblock changes
md: MD_RECOVERY_NEEDED is set for mddev->recovery
md: takeover should clear unrelated bits
md/r5cache: after recovery, increase journal seq by 10000
md/raid5-cache: fix crc in rewrite_data_only_stripes()
md/raid5-cache: no recovery is required when create super-block
md: fix refcount problem on mddev when stopping array.
md/r5cache: do r5c_update_log_state after log recovery
md/raid5-cache: adjust the write position of the empty block if no data blocks
md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0
md/raid5: limit request size according to implementation limits
md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly
md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log
md/raid5-cache: release the stripe_head at the appropriate location
md/raid5-cache: use ring add to prevent overflow
md/raid5-cache: remove unnecessary function parameters
raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
raid5-cache: add another check conditon before replaying one stripe
md/r5cache: enable IRQs on error path
md/r5cache: handle alloc_page failure
...
Pull swiotlb updates from Konrad Rzeszutek Wilk:
- minor fixes (rate limiting), remove certain functions
- support for DMA_ATTR_SKIP_CPU_SYNC which is an optimization
in the DMA API
* 'stable/for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: Minor fix-ups for DMA_ATTR_SKIP_CPU_SYNC support
swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC
swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function
swiotlb: Drop unused functions swiotlb_map_sg and swiotlb_unmap_sg
swiotlb: Rate-limit printing when running out of SW-IOMMU space
Pull workqueue updates from Tejun Heo:
"Mostly patches to initialize workqueue subsystem earlier and get rid
of keventd_up().
The patches were headed for the last merge cycle but got delayed due
to a bug found late minute, which is fixed now.
Also, to help debugging, destroy_workqueue() is more chatty now on a
sanity check failure."
* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: move wq_numa_init() to workqueue_init()
workqueue: remove keventd_up()
debugobj, workqueue: remove keventd_up() usage
slab, workqueue: remove keventd_up() usage
power, workqueue: remove keventd_up() usage
tty, workqueue: remove keventd_up() usage
mce, workqueue: remove keventd_up() usage
workqueue: make workqueue available early during boot
workqueue: dump workqueue state on sanity check failures in destroy_workqueue()
Here's the new driver core patches for 4.10-rc1.
Big thing here is the nice addition of "functional dependencies" to the
driver core. The idea has been talked about for a very long time, great
job to Rafael for stepping up and implementing it. It's been tested for
longer than the 4.9-rc1 date, we held off on merging it earlier in order
to feel more comfortable about it.
Other than that, it's just a handful of small other patches, some good
cleanups to the mess that is the firmware class code, and we have a test
driver for the deferred probe logic.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWFAvPQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym3NgCgmhFeWEkp9SDt17YGGavmnzQUlBQAoJlUipJp
PHeQkq15ZWw3wWC9FEvM
=91M1
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here's the new driver core patches for 4.10-rc1.
Big thing here is the nice addition of "functional dependencies" to
the driver core. The idea has been talked about for a very long time,
great job to Rafael for stepping up and implementing it. It's been
tested for longer than the 4.9-rc1 date, we held off on merging it
earlier in order to feel more comfortable about it.
Other than that, it's just a handful of small other patches, some good
cleanups to the mess that is the firmware class code, and we have a
test driver for the deferred probe logic.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (30 commits)
firmware: Correct handling of fw_state_wait() return value
driver core: Silence device links sphinx warning
firmware: remove warning at documentation generation time
drivers: base: dma-mapping: Fix typo in dmam_alloc_non_coherent comments
driver core: test_async: fix up typo found by 0-day
firmware: move fw_state_is_done() into UHM section
firmware: do not use fw_lock for fw_state protection
firmware: drop bit ops in favor of simple state machine
firmware: refactor loading status
firmware: fix usermode helper fallback loading
driver core: firmware_class: convert to use class_groups
driver core: devcoredump: convert to use class_groups
driver core: class: add class_groups support
kernfs: Declare two local data structures static
driver-core: fix platform_no_drv_owner.cocci warnings
drivers/base/memory.c: Remove unused 'first_page' variable
driver core: add CLASS_ATTR_WO()
drivers: base: cacheinfo: support DT overrides for cache properties
drivers: base: cacheinfo: add pr_fmt logging
drivers: base: cacheinfo: fix boot error message when acpi is enabled
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but should be
more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to go...
Lots of plain-text files have also been converted and integrated.
- Images in binary formats have been replaced with more source-friendly
versions.
- Various bits of organizational work, including the renaming of various
files discussed at the kernel summit.
- New documentation for the device_link mechanism.
...and, of course, lots of typo fixes and small updates.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
pmSJR0hVeRUmA4uw6+Su
=A0EV
-----END PGP SIGNATURE-----
Merge tag 'docs-4.10' of git://git.lwn.net/linux into drm-misc-next
Backmerge the docs-next branch from Jon into drm-misc so that we can
apply the dma-buf documentation cleanup patches. Git found a conflict
where there was none because both drm-misc and docs had identical
patches to clean up file rename issues in the rst include directives.
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but should be
more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to go...
Lots of plain-text files have also been converted and integrated.
- Images in binary formats have been replaced with more source-friendly
versions.
- Various bits of organizational work, including the renaming of various
files discussed at the kernel summit.
- New documentation for the device_link mechanism.
...and, of course, lots of typo fixes and small updates.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
pmSJR0hVeRUmA4uw6+Su
=A0EV
-----END PGP SIGNATURE-----
Merge tag 'docs-4.10' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but
should be more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to
go... Lots of plain-text files have also been converted and
integrated.
- Images in binary formats have been replaced with more
source-friendly versions.
- Various bits of organizational work, including the renaming of
various files discussed at the kernel summit.
- New documentation for the device_link mechanism.
... and, of course, lots of typo fixes and small updates"
* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
dma-buf: Extract dma-buf.rst
Update Documentation/00-INDEX
docs: 00-INDEX: document directories/files with no docs
docs: 00-INDEX: remove non-existing entries
docs: 00-INDEX: add missing entries for documentation files/dirs
docs: 00-INDEX: consolidate process/ and admin-guide/ description
scripts: add a script to check if Documentation/00-INDEX is sane
Docs: change sh -> awk in REPORTING-BUGS
Documentation/core-api/device_link: Add initial documentation
core-api: remove an unexpected unident
ppc/idle: Add documentation for powersave=off
Doc: Correct typo, "Introdution" => "Introduction"
Documentation/atomic_ops.txt: convert to ReST markup
Documentation/local_ops.txt: convert to ReST markup
Documentation/assoc_array.txt: convert to ReST markup
docs-rst: parse-headers.pl: cleanup the documentation
docs-rst: fix media cleandocs target
docs-rst: media/Makefile: reorganize the rules
docs-rst: media: build SVG from graphviz files
docs-rst: replace bayer.png by a SVG image
...
Merge updates from Andrew Morton:
- various misc bits
- most of MM (quite a lot of MM material is awaiting the merge of
linux-next dependencies)
- kasan
- printk updates
- procfs updates
- MAINTAINERS
- /lib updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
init: reduce rootwait polling interval time to 5ms
binfmt_elf: use vmalloc() for allocation of vma_filesz
checkpatch: don't emit unified-diff error for rename-only patches
checkpatch: don't check c99 types like uint8_t under tools
checkpatch: avoid multiple line dereferences
checkpatch: don't check .pl files, improve absolute path commit log test
scripts/checkpatch.pl: fix spelling
checkpatch: don't try to get maintained status when --no-tree is given
lib/ida: document locking requirements a bit better
lib/rbtree.c: fix typo in comment of ____rb_erase_color
lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
MAINTAINERS: add drm and drm/i915 irc channels
MAINTAINERS: add "C:" for URI for chat where developers hang out
MAINTAINERS: add drm and drm/i915 bug filing info
MAINTAINERS: add "B:" for URI where to file bugs
get_maintainer: look for arbitrary letter prefixes in sections
printk: add Kconfig option to set default console loglevel
printk/sound: handle more message headers
printk/btrfs: handle more message headers
printk/kdb: handle more message headers
...
Pull smp hotplug updates from Thomas Gleixner:
"This is the final round of converting the notifier mess to the state
machine. The removal of the notifiers and the related infrastructure
will happen around rc1, as there are conversions outstanding in other
trees.
The whole exercise removed about 2000 lines of code in total and in
course of the conversion several dozen bugs got fixed. The new
mechanism allows to test almost every hotplug step standalone, so
usage sites can exercise all transitions extensively.
There is more room for improvement, like integrating all the
pointlessly different architecture mechanisms of synchronizing,
setting cpus online etc into the core code"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
tracing/rb: Init the CPU mask on allocation
soc/fsl/qbman: Convert to hotplug state machine
soc/fsl/qbman: Convert to hotplug state machine
zram: Convert to hotplug state machine
KVM/PPC/Book3S HV: Convert to hotplug state machine
arm64/cpuinfo: Convert to hotplug state machine
arm64/cpuinfo: Make hotplug notifier symmetric
mm/compaction: Convert to hotplug state machine
iommu/vt-d: Convert to hotplug state machine
mm/zswap: Convert pool to hotplug state machine
mm/zswap: Convert dst-mem to hotplug state machine
mm/zsmalloc: Convert to hotplug state machine
mm/vmstat: Convert to hotplug state machine
mm/vmstat: Avoid on each online CPU loops
mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
tracing/rb: Convert to hotplug state machine
oprofile/nmi timer: Convert to hotplug state machine
net/iucv: Use explicit clean up labels in iucv_init()
x86/pci/amd-bus: Convert to hotplug state machine
x86/oprofile/nmi: Convert to hotplug state machine
...
I wanted to wrap a bunch of ida_simple_get calls into their own locking,
until I dug around and read the original commit message. Stuff like
this should imo be added to the kernel doc, let's do that.
Link: http://lkml.kernel.org/r/20161027072216.20411-1-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In Case 3 of `sibling == parent->rb_right':
Right rotation will not change color of sl and S in the diagram
(i.e. should not change "sl" to "Sl", "S" to "s")
In Case 3 of `sibling == parent->rb_left':
(p) (p)
/ \ / \
S N --> sr N
/ \ /
Sl sr S
/
Sl
This is actually left rotation at "S", not right rotation.
In Case 4 of `sibling == parent->rb_left':
(p) (s)
/ \ / \
S N --> Sl P
/ \ / \
sl (sr) (sr) N
This is actually right rotation at "(p)" + color flips, not left
rotation + color flips.
Link: http://lkml.kernel.org/r/1472391115-3702-1-git-send-email-fykcee1@gmail.com
Signed-off-by: Jie Chen <fykcee1@gmail.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_DEVMEM not set, CONFIG_STRICT_DEVMEM will be useless even if
it is set =y, thus let's update the dependency in Kconfig.
Link: http://lkml.kernel.org/r/20161006051217.GA31027@dhcp-128-65.nay.redhat.com
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a configuration option to set the default console loglevel. This
is, as before, still possible to override at runtime through bootargs
(loglevel=<x>), sysrq and /proc/printk.
There are cases where adding additional arguments on the commandline is
impractical, and changing the default for the kernel when being built
makes more sense. Provide such a method here, for those who choose to
do so.
Also, while touching this code, clarify the difference between
MESSAGE_LOGLEVEL_DEFAULT and CONSOLE_LOGLEVEL_DEFAULT.
Link: http://lkml.kernel.org/r/1479676829-30031-1-git-send-email-olof@lixom.net
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we track the shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we rely on random subtleties throughout the implementation (such as the
node->count != 1 check in the shrinking code, which is meant to exclude
multi-entry nodes but also happens to skip nodes with only one shadow
entry, as that's accounted in the upper bits). This is error prone and
has, in fact, caused the bug fixed in d3798ae8c6 ("mm: filemap: don't
plant shadow entries without radix tree node").
To remove these subtleties, this patch moves shadow entry tracking from
the upper bits of node->count to the existing counter for exceptional
entries. node->count goes back to being a simple counter of valid
entries in the tree node and can be shrunk to a single byte.
This vastly simplifies the page cache code. All accounting happens
natively inside the radix tree implementation, and maintaining the LRU
linkage of shadow nodes is consolidated into a single function in the
workingset code that is called for leaf nodes affected by a change in
the page cache tree.
This also removes the last user of the __radix_delete_node() return
value. Eliminate it.
Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support handing __radix_tree_replace() a callback that gets invoked for
all leaf nodes that change or get freed as a result of the slot
replacement, to assist users tracking nodes with node->private_list.
This prepares for putting page cache shadow entries into the radix tree
root again and drastically simplifying the shadow tracking.
Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page cache shadow entry handling will be a lot simpler when it can use a
single generic replacement function for pages, shadow entries, and
emptying slots.
Make __radix_tree_replace() properly account insertions and deletions in
node->count and garbage collect nodes as they become empty. Then
re-implement radix_tree_delete() on top of it.
Link: http://lkml.kernel.org/r/20161117193058.GC23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL<->!NULL transitions need to be caught but transitions from and to
exceptional entries as well. We need checks.
Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace(). This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.
Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The way the page cache is sneaking shadow entries of evicted pages into
the radix tree past the node entry accounting and tracking them manually
in the upper bits of node->count is fraught with problems.
These shadow entries are marked in the tree as exceptional entries,
which are a native concept to the radix tree. Maintain an explicit
counter of exceptional entries in the radix tree node. Subsequent
patches will switch shadow entry tracking over to that counter.
DAX and shmem are the other users of exceptional entries. Since slot
replacements that change the entry type from regular to exceptional must
now be accounted, introduce a __radix_tree_replace() function that does
replacement and accounting, and switch DAX and shmem over.
The increase in radix tree node size is temporary. A followup patch
switches the shadow tracking to this new scheme and we'll no longer need
the upper bits in node->count and shrink that back to one byte.
Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull locking updates from Ingo Molnar:
"The tree got pretty big in this development cycle, but the net effect
is pretty good:
115 files changed, 673 insertions(+), 1522 deletions(-)
The main changes were:
- Rework and generalize the mutex code to remove per arch mutex
primitives. (Peter Zijlstra)
- Add vCPU preemption support: add an interface to query the
preemption status of vCPUs and use it in locking primitives - this
optimizes paravirt performance. (Pan Xinhui, Juergen Gross,
Christian Borntraeger)
- Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to
clean up and improve the s390 lock yielding machinery and its core
kernel impact. (Christian Borntraeger)
- Micro-optimize mutexes some more. (Waiman Long)
- Reluctantly add the to-be-deprecated mutex_trylock_recursive()
interface on a temporary basis, to give the DRM code more time to
get rid of its locking hacks. Any other users will be NAK-ed on
sight. (We turned off the deprecation warning for the time being to
not pollute the build log.) (Peter Zijlstra)
- Improve the rtmutex code a bit, in light of recent long lived
bugs/races. (Thomas Gleixner)
- Misc fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/paravirt: Fix bool return type for PVOP_CALL()
x86/paravirt: Fix native_patch()
locking/ww_mutex: Use relaxed atomics
locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()
locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
Documentation/virtual/kvm: Support the vCPU preemption check
x86/xen: Support the vCPU preemption check
x86/kvm: Support the vCPU preemption check
x86/kvm: Support the vCPU preemption check
kvm: Introduce kvm_write_guest_offset_cached()
locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)
locking/core, powerpc: Implement vcpu_is_preempted(cpu)
sched/core: Introduce the vcpu_is_preempted(cpu) interface
sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
locking/core: Provide common cpu_relax_yield() definition
locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily
...
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this development cycle were:
- Miscellaneous fixes, including a change to call_rcu()'s rcu_head
alignment check.
- Security-motivated list consistency checks, which are disabled by
default behind DEBUG_LIST.
- Torture-test updates.
- Documentation updates, yet again just simple changes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
torture: Prevent jitter from delaying build-only runs
torture: Remove obsolete files from rcutorture .gitignore
rcu: Don't kick unless grace period or request
rcu: Make expedited grace periods recheck dyntick idle state
torture: Trace long read-side delays
rcu: RCU_TRACE enables event tracing as well as debugfs
rcu: Remove obsolete comment from __call_rcu()
rcu: Remove obsolete rcu_check_callbacks() header comment
rcu: Tighten up __call_rcu() rcu_head alignment check
Documentation/RCU: Fix minor typo
documentation: Present updated RCU guarantee
bug: Avoid Kconfig warning for BUG_ON_DATA_CORRUPTION
lib/Kconfig.debug: Fix typo in select statement
lkdtm: Add tests for struct list corruption
bug: Provide toggle for BUG on data corruption
list: Split list_del() debug checking into separate function
rculist: Consolidate DEBUG_LIST for list_add_rcu()
list: Split list_add() debug checking into separate function
This reverts commit 53855d10f4.
It shouldn't have come in yet - it depends on the changes in linux-next
that will come in during the next merge window. As Matthew Wilcox says,
the test suite is broken with the current state without the revert.
Requested-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test
suite as it adds a call to cpuhp_setup_state_nocalls() which is not
currently emulated in the test suite. Add it, and delete the emulation
of the old CPU hotplug mechanism.
Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull locking fixes from Ingo Molnar:
"Two rtmutex race fixes (which miraculously never triggered, that we
know of), plus two lockdep printk formatting regression fixes"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Fix report formatting
locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()
locking/rtmutex: Prevent dequeue vs. unlock race
locking/selftest: Fix output since KERN_CONT changes
Will be used by the nvme-fabrics FC transport in parsing options
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.
Convert some obvious users. *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case. Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Couple conflicts resolved here:
1) In the MACB driver, a bug fix to properly initialize the
RX tail pointer properly overlapped with some changes
to support variable sized rings.
2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
overlapping with a reorganization of the driver to support
ACPI, OF, as well as PCI variants of the chip.
3) In 'net' we had several probe error path bug fixes to the
stmmac driver, meanwhile a lot of this code was cleaned up
and reorganized in 'net-next'.
4) The cls_flower classifier obtained a helper function in
'net-next' called __fl_delete() and this overlapped with
Daniel Borkamann's bug fix to use RCU for object destruction
in 'net'. It also overlapped with Jiri's change to guard
the rhashtable_remove_fast() call with a check against
tc_skip_sw().
5) In mlx4, a revert bug fix in 'net' overlapped with some
unrelated changes in 'net-next'.
6) In geneve, a stale header pointer after pskb_expand_head()
bug fix in 'net' overlapped with a large reorganization of
the same code in 'net-next'. Since the 'net-next' code no
longer had the bug in question, there was nothing to do
other than to simply take the 'net-next' hunks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Gcc revision 241896 implements use-after-scope detection. Will be
available in gcc 7. Support it in KASAN.
Gcc emits 2 new callbacks to poison/unpoison large stack objects when
they go in/out of scope. Implement the callbacks and add a test.
[dvyukov@google.com: v3]
Link: http://lkml.kernel.org/r/1479998292-144502-1-git-send-email-dvyukov@google.com
Link: http://lkml.kernel.org/r/1479226045-145148-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drivers, or other modules, that use a mixture of objects (especially
objects embedded within other objects) would like to take advantage of
the debugobjects facilities to help catch misuse. Currently, the
debugobjects interface is only available to builtin drivers and requires
a set of EXPORT_SYMBOL_GPL for use by modules.
I am using the debugobjects in i915.ko to try and catch some invalid
operations on embedded objects. The problem currently only presents
itself across module unload so forcing i915 to be builtin is not an
option.
Link: http://lkml.kernel.org/r/20161122143039.6433-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Du, Changbin" <changbin.du@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.
Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the KERN_CONT changes the locking-selftest output is messed up, eg:
----------------------------------------------------------------------------
| spin |wlock |rlock |mutex | wsem | rsem |
--------------------------------------------------------------------------
A-A deadlock:
ok |
ok |
ok |
ok |
ok |
ok |
Use pr_cont() to get it looking normal again:
----------------------------------------------------------------------------
| spin |wlock |rlock |mutex | wsem | rsem |
--------------------------------------------------------------------------
A-A deadlock: ok | ok | ok | ok | ok | ok |
Reported-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/1480027528-934-1-git-send-email-mpe@ellerman.id.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
- Documentation updates, yet again just simple changes.
- Miscellaneous fixes, including a change to call_rcu()'s
rcu_head alignment check.
- Security-motivated list consistency checks, which are
disabled by default behind DEBUG_LIST.
- Torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull sparc fixes from David Miller:
1) With modern networking cards we can run out of 32-bit DMA space, so
support 64-bit DMA addressing when possible on sparc64. From Dave
Tushar.
2) Some signal frame validation checks are inverted on sparc32, fix
from Andreas Larsson.
3) Lockdep tables can get too large in some circumstances on sparc64,
add a way to adjust the size a bit. From Babu Moger.
4) Fix NUMA node probing on some sun4v systems, from Thomas Tai.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: drop duplicate header scatterlist.h
lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
sunbmac: Fix compiler warning
sunqe: Fix compiler warnings
sparc64: Enable 64-bit DMA
sparc64: Enable sun4v dma ops to use IOMMU v2 APIs
sparc64: Bind PCIe devices to use IOMMU v2 service
sparc64: Initialize iommu_map_table and iommu_pool
sparc64: Add ATU (new IOMMU) support
sparc64: Add FORCE_MAX_ZONEORDER and default to 13
sparc64: fix compile warning section mismatch in find_node()
sparc32: Fix inverted invalid_frame_pointer checks on sigreturns
sparc64: Fix find_node warning if numa node cannot be found
Length of a netlink attribute may be u16 but lengths of basic attributes
are much smaller, so small we can save 16 bytes of .rodata and pocket
change inside .text.
16-bit is worse on x86-64 than 8-bit because of operand size override prefix.
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-19 (-19)
function old new delta
validate_nla 418 417 -1
nla_policy_len 66 64 -2
nla_attr_minlen 32 16 -16
Total: Before=154865051, After=154865032, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYHmoCAAoJEHm+PkMAQRiG7RMIAI2i7Y5hpL5yCxK5AFaL4u/G
KxXfp1B1UanUTgjOmd7zGqtDYcFX9t7GTTUFixQ7/9Opr4PD9qbnatoDGSc3xjbT
msDgA1B78F1/Q3kHWfeGq32MihQ4mj5NwUCo+igUcUvvWG7mHgzErj/Nh5RoobQX
p/izdpTbrw3GX6xXB8olbG7XWHaVye/+TT3q6+gmgm8I/QEujcLeGoycE0zlhPN8
FG/JX76At/+ZM2Py7Oxo3k+oKL9CHrtOQYDp/wN0uslV5eYvvkZz0/M1HMOGZt+c
gZU5jzM17K7C4Nzo06WAuBU9wUBGc25m+cPicLlOmljnzfU+f50SKaDjZq3p7QI=
=2KUF
-----END PGP SIGNATURE-----
Merge tag 'v4.9-rc4' into sound
Bring in -rc4 patches so I can successfully merge the sound doc changes.
This new config parameter limits the space used for "Lock debugging:
prove locking correctness" by about 4MB. The current sparc systems have
the limitation of 32MB size for kernel size including .text, .data and
.bss sections. With PROVE_LOCKING feature, the kernel size could grow
beyond this limit and causing system boot-up issues. With this option,
kernel limits the size of the entries of lock_chains, stack_trace etc.,
so that kernel fits in required size limit. This is not visible to user
and only used for sparc.
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
iov_iter_advance() needs to decrement iter->count by the number of
bytes we'd moved beyond. Normal flavours do that, but ITER_PIPE
doesn't and ITER_PIPE generic_file_read_iter() for O_DIRECT files
ends up with a bogus fallback to page cache read, resulting in incorrect
values for file offset and bytes read.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the s390 special case of a yielding cpu_relax() implementation gone,
we can now remove all users of cpu_relax_lowlatency() and replace them
with cpu_relax().
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1477386195-32736-5-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The commit brings the RCU_TRACE Kconfig option's help text up to date
by noting that it enables additional event tracing as well as debugfs.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
[ paulmck: Do some wordsmithing. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Some drivers would like to record stacktraces in order to aide leak
tracing. As stackdepot already provides a facility for only storing the
unique traces, thereby reducing the memory required, export that
functionality for use by drivers.
The code was originally created for KASAN and moved under lib in commit
cd11016e5f ("mm, kasan: stackdepot implementation. Enable stackdepot
for SLAB") so that it could be shared with mm/. In turn, we want to
share it now with drivers.
Link: http://lkml.kernel.org/r/20161108133209.22704-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I am updating the paths so that instead of trying to pass
"attr | DMA_ATTR_SKIP_CPU_SYNC" we instead just OR the value into attr and
then pass it since attr will not be used after we make the unmap call.
I realized there was one spot I had missed when I was applying the DMA
attribute to the DMA mapping exception handling. This change corrects that.
Finally it looks like there is a stray blank line at the end of the
swiotlb_unmap_sg_attrs function that can be dropped.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Install the callbacks via the state machine.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-6-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-5-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement the AVX2 optimization of RAID6 xor_syndrome functions which is
simply based on sse2.c written by hpa.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Yuanhan Liu <yuanhan.liu@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
beyond just ARM I need to make it so that the swiotlb will respect the
flag. In order to do that I also need to update the swiotlb-xen since it
heavily makes use of the functionality.
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
There are no users for swiotlb_map_sg or swiotlb_unmap_sg so we might as
well just drop them.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
If the system runs out of SW-IOMMU space, changes are high successive
requests will fail, too, flooding the kernel log. This is true
especially for streaming DMA, which is typically used repeatedly outside
the driver's initialization routine. Add rate-limiting to fix this.
While at it, get rid of the open-coded dev_name() handling by using the
appropriate dev_err_*() variant.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
The CONFIG_DEBUG_LIST option is normally meant for kernel developers
rather than production machines and is guarded by CONFIG_DEBUG_KERNEL.
In contrast, the newly added CONFIG_BUG_ON_DATA_CORRUPTION is meant
for security hardening and may be used on systems that intentionally
do not enable CONFIG_DEBUG_KERNEL. In this configuration, we get
a warning from Kconfig about the mismatched dependencies:
warning: (BUG_ON_DATA_CORRUPTION) selects DEBUG_LIST which has unmet direct dependencies (DEBUG_KERNEL)
This annotates the DEBUG_LIST option to be selectable by
BUG_ON_DATA_CORRUPTION when DEBUG_KERNEL is disabled.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 40cd725cfc7f ("bug: Provide toggle for BUG on data corruption")
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 484f29c7430b3 ("bug: Provide toggle for BUG on data corruption")
added a Kconfig select statement on CONFIG_DEBUG_LIST, but the CONFIG_
prefix is only used in Make and C(PP) syntax. Remove the CONFIG_ prefix
to correctly select the Kconfig option DEBUG_LIST.
Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
The kernel checks for cases of data structure corruption under some
CONFIGs (e.g. CONFIG_DEBUG_LIST). When corruption is detected, some
systems may want to BUG() immediately instead of letting the system run
with known corruption. Usually these kinds of manipulation primitives can
be used by security flaws to gain arbitrary memory write control. This
provides a new config CONFIG_BUG_ON_DATA_CORRUPTION and a corresponding
macro CHECK_DATA_CORRUPTION for handling these situations. Notably, even
if not BUGing, the kernel should not continue processing the corrupted
structure.
This is inspired by similar hardening by Syed Rameez Mustafa in MSM
kernels, and in PaX and Grsecurity, which is likely in response to earlier
removal of the BUG calls in commit 924d9addb9 ("list debugging: use
WARN() instead of BUG()").
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Similar to the list_add() debug consolidation, this commit consolidates
the debug checking performed during CONFIG_DEBUG_LIST into a new
__list_del_entry_valid() function, and stops list updates when corruption
is found.
Refactored from same hardening in PaX and Grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
This commit consolidates the debug checking for list_add_rcu() into the
new single __list_add_valid() debug function. Notably, this commit fixes
the sanity check that was added in commit 17a801f4bf ("list_debug:
WARN for adding something already in the list"), which wasn't checking
RCU-protected lists.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Right now, __list_add() code is repeated either in list.h or in
list_debug.c, but the only differences between the two versions
are the debug checks. This commit therefore extracts these debug
checks into a separate __list_add_valid() function and consolidates
__list_add(). Additionally this new __list_add_valid() function will stop
list manipulations if a corruption is detected, instead of allowing for
further corruption that may lead to even worse conditions.
This is slight refactoring of the same hardening done in PaX and Grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Pull networking fixes from David Miller:
"Lots of fixes, mostly drivers as is usually the case.
1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
Khoroshilov.
2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
Pedersen.
3) Don't put aead_req crypto struct on the stack in mac80211, from
Ard Biesheuvel.
4) Several uninitialized variable warning fixes from Arnd Bergmann.
5) Fix memory leak in cxgb4, from Colin Ian King.
6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.
7) Several VRF semantic fixes from David Ahern.
8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.
9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.
10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.
11) Fix stale link state during failover in NCSCI driver, from Gavin
Shan.
12) Fix netdev lower adjacency list traversal, from Ido Schimmel.
13) Propvide proper handle when emitting notifications of filter
deletes, from Jamal Hadi Salim.
14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.
15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.
16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.
17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.
18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
Leitner.
19) Revert a netns locking change that causes regressions, from Paul
Moore.
20) Add recursion limit to GRO handling, from Sabrina Dubroca.
21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.
22) Avoid accessing stale vxlan/geneve socket in data path, from
Pravin Shelar"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
geneve: avoid using stale geneve socket.
vxlan: avoid using stale vxlan socket.
qede: Fix out-of-bound fastpath memory access
net: phy: dp83848: add dp83822 PHY support
enic: fix rq disable
tipc: fix broadcast link synchronization problem
ibmvnic: Fix missing brackets in init_sub_crq_irqs
ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
net/mlx4_en: Save slave ethtool stats command
net/mlx4_en: Fix potential deadlock in port statistics flow
net/mlx4: Fix firmware command timeout during interrupt test
net/mlx4_core: Do not access comm channel if it has not yet been initialized
net/mlx4_en: Fix panic during reboot
net/mlx4_en: Process all completions in RX rings after port goes up
net/mlx4_en: Resolve dividing by zero in 32-bit system
net/mlx4_core: Change the default value of enable_qos
net/mlx4_core: Avoid setting ports to auto when only one port type is supported
net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
...
In the first case, rename the second variable to correspond to the name
found in the function parameter list.
In the remaining cases, reorder the variables to correspond to their order
in the parameter list.
Issue detected using Coccinelle (http://coccinelle.lip6.fr/)
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
gen_pool_alloc_algo() iterates over the chunks of a pool trying to find
a contiguous block of memory that satisfies the allocation request.
The shortcut
if (size > atomic_read(&chunk->avail))
continue;
makes the loop skip over chunks that do not have enough bytes left to
fulfill the request. There are two situations, though, where an
allocation might still fail:
(1) The available memory is not contiguous, i.e. the request cannot
be fulfilled due to external fragmentation.
(2) A race condition. Another thread runs the same code concurrently
and is quicker to grab the available memory.
In those situations, the loop calls pool->algo() to search the entire
chunk, and pool->algo() returns some value that is >= end_bit to
indicate that the search failed. This return value is then assigned to
start_bit. The variables start_bit and end_bit describe the range that
should be searched, and this range should be reset for every chunk that
is searched. Today, the code fails to reset start_bit to 0. As a
result, prefixes of subsequent chunks are ignored. Memory allocations
might fail even though there is plenty of room left in these prefixes of
those other chunks.
Fixes: 7f184275aa ("lib, Make gen_pool memory allocator lockless")
Link: http://lkml.kernel.org/r/1477420604-28918-1-git-send-email-danielmentz@google.com
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASAN uses stackdepot to memorize stacks for all kmalloc/kfree calls.
Current stackdepot capacity is 16MB (1024 top level entries x 4 pages on
second level). Size of each stack is (num_frames + 3) * sizeof(long).
Which gives us ~84K stacks. This capacity was chosen empirically and it
is enough to run kernel normally.
However, when lots of configs are enabled and a fuzzer tries to maximize
code coverage, it easily hits the limit within tens of minutes. I've
tested for long a time with number of top level entries bumped 4x
(4096). And I think I've seen overflow only once. But I don't have all
configs enabled and code coverage has not reached maximum yet. So bump
it 8x to 8192.
Since we have two-level table, memory cost of this is very moderate --
currently the top-level table is 8KB, with this patch it is 64KB, which
is negligible under KASAN.
Here is some approx math.
128MB allows us to memorize ~670K stacks (assuming stack is ~200b).
I've grepped kernel for kmalloc|kfree|kmem_cache_alloc|kmem_cache_free|
kzalloc|kstrdup|kstrndup|kmemdup and it gives ~60K matches. Most of
alloc/free call sites are reachable with only one stack. But some
utility functions can have large fanout. Assuming average fanout is 5x,
total number of alloc/free stacks is ~300K.
Link: http://lkml.kernel.org/r/1476458416-122131-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building with the latent_entropy plugin, set the default
CONFIG_FRAME_WARN to 2048, since some __init functions have many basic
blocks that, when instrumented by the latent_entropy plugin, grow beyond
1024 byte stack size on 32-bit builds.
Link: http://lkml.kernel.org/r/20161018211216.GA39687@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
After commit 636c262808 ("net: skbuff: Remove errornous length
validation in skb_vlan_pop()") mentioned test case stopped working,
throwing a -12 (ENOMEM) return code. The issue however is not due to
636c262808, but rather due to a buggy test case that got uncovered
from the change in behaviour in 636c262808.
The data_size of that test case for the skb was set to 1. In the
bpf_fill_ld_abs_vlan_push_pop() handler bpf insns are generated that
loop with: reading skb data, pushing 68 tags, reading skb data,
popping 68 tags, reading skb data, etc, in order to force a skb
expansion and thus trigger that JITs recache skb->data. Problem is
that initial data_size is too small.
While before 636c262808, the test silently bailed out due to the
skb->len < VLAN_ETH_HLEN check with returning 0, and now throwing an
error from failing skb_ensure_writable(). Set at least minimum of
ETH_HLEN as an initial length so that on first push of data, equivalent
pop will succeed.
Fixes: 4d9c5c53ac ("test_bpf: add bpf_skb_vlan_push/pop() tests")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
pa/A7CNQwibIV6YD8+/p
=1dUK
-----END PGP SIGNATURE-----
Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"
* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin
Pull more misc uaccess and vfs updates from Al Viro:
"The rest of the stuff from -next (more uaccess work) + assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
score: traps: Add missing include file to fix build error
fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
fs/super.c: fix race between freeze_super() and thaw_super()
overlayfs: Fix setting IOP_XATTR flag
iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector()
blackfin: no access_ok() for __copy_{to,from}_user()
arm64: don't zero in __copy_from_user{,_inatomic}
arm: don't zero in __copy_from_user_inatomic()/__copy_from_user()
arc: don't leak bits of kernel stack into coredump
alpha: get rid of tail-zeroing in __copy_user()
Both import_iovec() and rw_copy_check_uvector() take an array
(typically small and on-stack) which is used to hold an iovec array copy
from userspace. This is to avoid an expensive memory allocation in the
fast path (i.e. few iovec elements).
The caller may have to check whether these functions actually used
the provided buffer or allocated a new one -- but this differs between
the too. Let's just add a kernel doc to clarify what the semantics are
for each function.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This update consists of:
- Fixes and improvements to existing tests
- Moving code from Documentation to selftests, samples, and tools.
Moves dnotify_test, prctl, ptp, vDSO, ia64, watchdog, and networking
tests from Documentation to selftests.
Moves mic/mpssd, misc-devices/mei, timers, watchdog, auxdisplay, and
blackfin examples from Documentation to samples.
Moves accounting, laptops/dslm, and pcmcia/crc32hash tools from
Documentation to tools.
Deletes BUILD_DOCSRC and its dependencies.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX/6zUAAoJEAsCRMQNDUMczIEP/0kH+yjJ3El4GYIokspR1/UU
++sy4XMzrD1UPy90v+ftcg4ss5R80r0v7EZ59k1UjDJSZ6WATHHGoZKCS2Dy3xcq
i+0vm7Bawh7YWrXD3TunwaL97lwb2DdVTSxRXuU4Hfv+oVynUfh/+ZlCH6RCM2nm
ZJE5PDYiq4nTVSRqFB2FyRE6yay5dPvpQ2ArwnSEw+ku4C+ZdGTGCWzS+aZBwZM/
ykePkGLVRXz9FsWTCmipJzYu0Z/M4xEGlfXQZiiLG2HicbJNP6AqJImbQrANm+TW
RFigYpofdhr9XG5TKTLIudaRt9qB6BE0mYEApZXH8U7NrHElfO9BBMEwzajl0V/2
q/r5iej/CJult3zsfkhdHo7GLXpOaDLyoXiUI6UTgL0XOdWLAWTqDYx4JJz9sXxp
B9dwKJeP5HLipk6FMkAHgJM90JKQFd/nLDKxeWexbMu/b/yQ2C9AR7NpdQ+c1X7I
8W8UNEi/fnK75+r4t3NfeD2/5boq/jwujSKEMDQm/3R8L8EFYYb/TRoujFn89Na3
wbZLV3hBL+KQ5lRyIx7X8RKyVJv1nlo9Wh57ItJed6zvGp5EmsI8w+DER2RfbO2c
HR2JPDKSxmU8O2WBfDW5QoiPQH8Lssd147Ir0UFE7mwBXgWWsmxJxDpufizAXwyJ
qnELJ9X3UFIdydtoObLr
=60kH
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-4.9-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest updates from Shuah Khan:
"This update consists of:
- Fixes and improvements to existing tests
- Moving code from Documentation to selftests, samples, and tools:
* Moves dnotify_test, prctl, ptp, vDSO, ia64, watchdog, and
networking tests from Documentation to selftests.
* Moves mic/mpssd, misc-devices/mei, timers, watchdog, auxdisplay,
and blackfin examples from Documentation to samples.
* Moves accounting, laptops/dslm, and pcmcia/crc32hash tools from
Documentation to tools.
* Deletes BUILD_DOCSRC and its dependencies"
* tag 'linux-kselftest-4.9-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (21 commits)
selftests/futex: Check ANSI terminal color support
Doc: update 00-INDEX files to reflect the runnable code move
samples: move blackfin gptimers-example from Documentation
tools: move pcmcia crc32hash tool from Documentation
tools: move laptops dslm tool from Documentation
tools: move accounting tool from Documentation
samples: move auxdisplay example code from Documentation
samples: move watchdog example code from Documentation
samples: move timers example code from Documentation
samples: move misc-devices/mei example code from Documentation
samples: move mic/mpssd example code from Documentation
selftests: Move networking/timestamping from Documentation
selftests: move watchdog tests from Documentation/watchdog
selftests: move ia64 tests from Documentation/ia64
selftests: move vDSO tests from Documentation/vDSO
selftests: move ptp tests from Documentation/ptp
selftests: move prctl tests from Documentation/prctl
selftests: move dnotify_test from Documentation/filesystems
selftests/timers: Add missing error code assignment before test
selftests/zram: replace ZRAM_LZ4_COMPRESS
...
Pull percpu updates from Tejun Heo:
- Nick improved generic implementations of percpu operations which
modify the variable and return so that they calculate the physical
address only once.
- percpu_ref percpu <-> atomic mode switching improvements. The
patchset was originally posted about a year ago but fell through the
crack.
- misc non-critical fixes.
* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
percpu: eliminate two sparse warnings
percpu: improve generic percpu modify-return implementation
percpu-refcount: init ->confirm_switch member properly
percpu_ref: allow operation mode switching operations to be called concurrently
percpu_ref: restructure operation mode switching
percpu_ref: unify staggered atomic switching wait behavior
percpu_ref: reorganize __percpu_ref_switch_to_atomic() and relocate percpu_ref_switch_to_atomic()
percpu_ref: remove unnecessary RCU grace period for staggered atomic switching confirmation
There's no point in collecting coverage from lib/stackdepot.c, as it is
not a function of syscall inputs. Disabling kcov instrumentation for that
file will reduce the coverage noise level.
Link: http://lkml.kernel.org/r/1474640972-104131-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Today there are platforms with many CPUs (up to 4K). Trying to boot only
part of the CPUs may result in too long string.
For example lets take NPS platform that is part of arch/arc. This
platform have SMP system with 256 cores each with 16 HW threads (SMT
machine) where HW thread appears as CPU to the kernel. In this example
there is total of 4K CPUs. When one tries to boot only part of the HW
threads from each core the string representing the map may be long... For
example if for sake of performance we decided to boot only first half of
HW threads of each core the map will look like:
0-7,16-23,32-39,...,4080-4087
This patch introduce new syntax to accommodate with such use case. I
added an optional postfix to a range of CPUs which will choose according
to given modulo the desired range of reminders i.e.:
<cpus range>:sed_size/group_size
For example, above map can be described in new syntax like this:
0-4095:8/16
Note that this patch is backward compatible with current syntax.
[akpm@linux-foundation.org: rework documentation]
Link: http://lkml.kernel.org/r/1473579629-4283-1-git-send-email-noamca@mellanox.com
Signed-off-by: Noam Camus <noamca@mellanox.com>
Cc: David Decotigny <decot@googlers.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set "overflow" bit upon encountering it instead of postponing to the end
of the conversion. Somehow gcc unwedges itself and generates better code:
$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
_parse_integer 177 139 -38
Inspired by patch from Zhaoxiu Zeng.
Link: http://lkml.kernel.org/r/20160826221920.GA1909@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The strncpy_from_user() accessor is effectively a copy_from_user()
specialised to copy strings, terminating early at a NUL byte if possible.
In other respects it is identical, and can be used to copy an arbitrarily
large buffer from userspace into the kernel. Conceptually, it exposes a
similar attack surface.
As with copy_from_user(), we check the destination range when the kernel
is built with KASAN, but unlike copy_from_user() we do not check the
destination buffer when using HARDENED_USERCOPY. As strncpy_from_user()
calls get_user() in a loop, we must call check_object_size() explicitly.
This patch adds this instrumentation to strncpy_from_user(), per the same
rationale as with the regular copy_from_user(). In the absence of
hardened usercopy this will have no impact as the instrumentation expands
to an empty static inline function.
Link: http://lkml.kernel.org/r/1472221903-31181-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
it actually worked only when requested area ended on the page boundary...
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.
These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.
Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.
There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...
Move blackfin gptimers-example to samples and remove it from Documentation
Makefile. Update samples Kconfig and Makefile to build gptimers-example.
blackfin is the last CONFIG_BUILD_DOCSRC target in Documentation/Makefile.
Hence this patch also includes changes to remove CONFIG_BUILD_DOCSRC from
Makefile and lib/Kconfig.debug and updates VIDEO_PCI_SKELETON dependency
on BUILD_DOCSRC.
Documentation/Makefile is not deleted to avoid braking make htmldocs and
make distclean.
Acked-by: Michal Marek <mmarek@suse.com>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reported-by: Valentin Rothberg <valentinrothberg@gmail.com>
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Merge updates from Andrew Morton:
- fsnotify updates
- ocfs2 updates
- all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
console: don't prefer first registered if DT specifies stdout-path
cred: simpler, 1D supplementary groups
CREDITS: update Pavel's information, add GPG key, remove snail mail address
mailmap: add Johan Hovold
.gitattributes: set git diff driver for C source code files
uprobes: remove function declarations from arch/{mips,s390}
spelling.txt: "modeled" is spelt correctly
nmi_backtrace: generate one-line reports for idle cpus
arch/tile: adopt the new nmi_backtrace framework
nmi_backtrace: do a local dump_stack() instead of a self-NMI
nmi_backtrace: add more trigger_*_cpu_backtrace() methods
min/max: remove sparse warnings when they're nested
Documentation/filesystems/proc.txt: add more description for maps/smaps
mm, proc: fix region lost in /proc/self/smaps
proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
proc: add LSM hook checks to /proc/<tid>/timerslack_ns
proc: relax /proc/<tid>/timerslack_ns capability requirements
meminfo: break apart a very long seq_printf with #ifdefs
seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
proc: faster /proc/*/status
...
When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".
We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.
This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.
Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently on arm there is code that checks whether it should call
dump_stack() explicitly, to avoid trying to raise an NMI when the
current context is not preemptible by the backtrace IPI. Similarly, the
forthcoming arch/tile support uses an IPI mechanism that does not
support generating an NMI to self.
Accordingly, move the code that guards this case into the generic
mechanism, and invoke it unconditionally whenever we want a backtrace of
the current cpu. It seems plausible that in all cases, dump_stack()
will generate better information than generating a stack from the NMI
handler. The register state will be missing, but that state is likely
not particularly helpful in any case.
Or, if we think it is helpful, we should be capturing and emitting the
current register state in all cases when regs == NULL is passed to
nmi_cpu_backtrace().
Link: http://lkml.kernel.org/r/1472487169-14923-3-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Aaron Tomlin <atomlin@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "improvements to the nmi_backtrace code" v9.
This patch series modifies the trigger_xxx_backtrace() NMI-based remote
backtracing code to make it more flexible, and makes a few small
improvements along the way.
The motivation comes from the task isolation code, where there are
scenarios where we want to be able to diagnose a case where some cpu is
about to interrupt a task-isolated cpu. It can be helpful to see both
where the interrupting cpu is, and also an approximation of where the
cpu that is being interrupted is. The nmi_backtrace framework allows us
to discover the stack of the interrupted cpu.
I've tested that the change works as desired on tile, and build-tested
x86, arm, mips, and sparc64. For x86 I confirmed that the generic
cpuidle stuff as well as the architecture-specific routines are in the
new cpuidle section. For arm, mips, and sparc I just build-tested it
and made sure the generic cpuidle routines were in the new cpuidle
section, but I didn't attempt to figure out which the platform-specific
idle routines might be. That might be more usefully done by someone
with platform experience in follow-up patches.
This patch (of 4):
Currently you can only request a backtrace of either all cpus, or all
cpus but yourself. It can also be helpful to request a remote backtrace
of a single cpu, and since we want that, the logical extension is to
support a cpumask as the underlying primitive.
This change modifies the existing lib/nmi_backtrace.c code to take a
cpumask as its basic primitive, and modifies the linux/nmi.h code to use
the new "cpumask" method instead.
The existing clients of nmi_backtrace (arm and x86) are converted to
using the new cpumask approach in this change.
The other users of the backtracing API (sparc64 and mips) are converted
to use the cpumask approach rather than the all/allbutself approach.
The mips code ignored the "include_self" boolean but with this change it
will now also dump a local backtrace if requested.
Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This came to light when implementing native 64-bit atomics for ARCv2.
The atomic64 self-test code uses CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
to check whether atomic64_dec_if_positive() is available. It seems it
was needed when not every arch defined it. However as of current code
the Kconfig option seems needless
- for CONFIG_GENERIC_ATOMIC64 it is auto-enabled in lib/Kconfig and a
generic definition of API is present lib/atomic64.c
- arches with native 64-bit atomics select it in arch/*/Kconfig and
define the API in their headers
So I see no point in keeping the Kconfig option
Compile tested for:
- blackfin (CONFIG_GENERIC_ATOMIC64)
- x86 (!CONFIG_GENERIC_ATOMIC64)
- ia64
Link: http://lkml.kernel.org/r/1473703083-8625-3-git-send-email-vgupta@synopsys.com
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Zhaoxiu Zeng <zhaoxiu.zeng@gmail.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS splice updates from Al Viro:
"There's a bunch of branches this cycle, both mine and from other folks
and I'd rather send pull requests separately.
This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
(and introduction of such). Gets rid of a lot of code in fs/splice.c
and elsewhere; there will be followups, but these are for the next
cycle... Some pipe/splice-related cleanups from Miklos in the same
branch as well"
* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
pipe: fix comment in pipe_buf_operations
pipe: add pipe_buf_steal() helper
pipe: add pipe_buf_confirm() helper
pipe: add pipe_buf_release() helper
pipe: add pipe_buf_get() helper
relay: simplify relay_file_read()
switch default_file_splice_read() to use of pipe-backed iov_iter
switch generic_file_splice_read() to use of ->read_iter()
new iov_iter flavour: pipe-backed
fuse_dev_splice_read(): switch to add_to_pipe()
skb_splice_bits(): get rid of callback
new helper: add_to_pipe()
splice: lift pipe_lock out of splice_to_pipe()
splice: switch get_iovec_page_array() to iov_iter
splice_to_pipe(): don't open-code wakeup_pipe_readers()
consistent treatment of EFAULT on O_DIRECT read/write
Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.
As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.
This pull request contains:
- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.
- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.
- General fixes from Bart, Baoyou, Guoqing, and Linus W.
- CFQ async write starvation fix from Glauber.
- Add supprot for delayed kick of the requeue list, from Mike.
- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.
- bdev thaw error progagation from Pierre.
- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.
- Set of minor cleanups from Christoph in block/blk-mq.
- Set of cleanups and optimizations from me for block/blk-mq.
- Various nvme/nvmet/nvmeof fixes from the various folks"
* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...
Pull MD updates from Shaohua Li:
"This update includes:
- new AVX512 instruction based raid6 gen/recovery algorithm
- a couple of md-cluster related bug fixes
- fix a potential deadlock
- set nonrotational bit for raid array with SSD
- set correct max_hw_sectors for raid5/6, which hopefuly can improve
performance a little bit
- other minor fixes"
* tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: set rotational bit
raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
raid5: handle register_shrinker failure
raid5: fix to detect failure of register_shrinker
md: fix a potential deadlock
md/bitmap: fix wrong cleanup
raid5: allow arbitrary max_hw_sectors
lib/raid6: Add AVX512 optimized xor_syndrome functions
lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
lib/raid6: Add AVX512 optimized recovery functions
lib/raid6: Add AVX512 optimized gen_syndrome functions
md-cluster: make resync lock also could be interruptted
md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
md-cluster: convert the completion to wait queue
md-cluster: protect md_find_rdev_nr_rcu with rcu lock
md-cluster: clean related infos of cluster
md: changes for MD_STILL_CLOSED flag
md-cluster: remove some unnecessary dlm_unlock_sync
md-cluster: use FORCEUNLOCK in lockres_free
md-cluster: call md_kick_rdev_from_array once ack failed
This is bit large pile of code which bring in some nice additions:
- Error reporting: we have added a new mechanism for users of dmaenegine to
register a callback_result which tells them the result of the dma
transaction. Right now only one user ntb is using it.
- As we discussed on KS mailing list and pointed out NO_IRQ has no place in
kernel, this also remove NO_IRQ from dmaengine subsystem (both arm and
ppc users)
- Support for IOMMU slave transfers and it implementation for arm.
- To get better build coverage, enable COMPILE_TEST for bunch of driver,
and fix the warning and sparse complaints on these.
- Apart from above, usual updates spread across drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX9cGYAAoJEHwUBw8lI4NHXh0P/3OsctPYnwcangOz268hHDap
7ZHwau96K7DRi8cFCc0XmG083Ivqih/fWMFJBUOEsuwS3zPHkfgfhsvm7MqrK3vv
psJIwnubwTVVQ3lePYJlnna6mijcRNXVAooRLiqylA3QPIYRxECDFVDRNwf39D+I
bYp5tmlFcobugOUUoMqq1D/gH8EHUWxrnrsS6UBBpYm+cusc6u9/JXlOb4pcJGSL
V340zQ0S9FNuEM3b+1kMAeq3DG2wLXv9oJzz/6EN59sx5AdjlYUPHd/PvTYOeG0T
crdtDfL+7xcqP0Ms4SGTOD4kXSe6nErr3bIBHQXI6ZmJn0j//+3yU21kTMl95kM+
RM7nE4vItuQR0jPxVlhuLCcf3q7zMi+noOPZ1DVRTE1Yf9AizAgbPXyOE+jzGUUi
6E+0Mj6CLpFH/Mffxphs7L6GKwfWqaLjAupbjR6EWZud37KAwvpcB1CkJEgT9C4s
OiZ4INTPxXmw9dX/T9CPOyh8oZ8mB9LTUzHoJDvDGuwYm7HE0U9pzHG4bP0mjIIt
y3RboP78t1HC9oZUrxCoGhvekJtok0k3RLGJTSx9ujklY9MJGG/F1KEC6APp5tXu
0UToMXpgXSUkKEZesmsJFj/lbh1+h/yo5zTG5Hek8lh1K0sczaoWu3xTTSY9SSZQ
ihlqyvdzSBweKo8ktU8A
=9iA3
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-4.9-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine updates from Vinod Koul:
"This is bit large pile of code which bring in some nice additions:
- Error reporting: we have added a new mechanism for users of
dmaenegine to register a callback_result which tells them the
result of the dma transaction. Right now only one user (ntb) is
using it.
- As we discussed on KS mailing list and pointed out NO_IRQ has no
place in kernel, this also remove NO_IRQ from dmaengine subsystem
(both arm and ppc users)
- Support for IOMMU slave transfers and its implementation for arm.
- To get better build coverage, enable COMPILE_TEST for bunch of
driver, and fix the warning and sparse complaints on these.
- Apart from above, usual updates spread across drivers"
* tag 'dmaengine-4.9-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (169 commits)
async_pq_val: fix DMA memory leak
dmaengine: virt-dma: move function declarations
dmaengine: omap-dma: Enable burst and data pack for SG
DT: dmaengine: rcar-dmac: document R8A7743/5 support
dmaengine: fsldma: Unmap region obtained by of_iomap
dmaengine: jz4780: fix resource leaks on error exit return
dma-debug: fix ia64 build, use PHYS_PFN
dmaengine: coh901318: fix integer overflow when shifting more than 32 places
dmaengine: edma: avoid uninitialized variable use
dma-mapping: fix m32r build warning
dma-mapping: fix ia64 build, use PHYS_PFN
dmaengine: ti-dma-crossbar: enable COMPILE_TEST
dmaengine: omap-dma: enable COMPILE_TEST
dmaengine: edma: enable COMPILE_TEST
dmaengine: ti-dma-crossbar: Fix of_device_id data parameter usage
dmaengine: ti-dma-crossbar: Correct type for of_find_property() third parameter
dmaengine/ARM: omap-dma: Fix the DMAengine compile test on non OMAP configs
dmaengine: edma: Rename set_bits and remove unused clear_bits helper
dmaengine: edma: Use correct type for of_find_property() third parameter
dmaengine: edma: Fix of_device_id data parameter usage (legacy vs TPCC)
...
iov_iter variant for passing data into pipe. copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it. Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied. iov_iter_advance()
will truncate everything past the spot it has advanced to.
New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.
Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.
In other words, ->read_iter() on those acts pretty much like
->splice_read(). Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull networking updates from David Miller:
1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
co. at Google. https://lwn.net/Articles/701165/
2) Do TCP Small Queues for retransmits, from Eric Dumazet.
3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
Starovoitov.
4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.
5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.
6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.
7) Support ndo_poll_controller in mlx5, from Calvin Owens.
8) Move VRF processing to an output hook and allow l3mdev to be
loopback, from David Ahern.
9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.
10) Congestion control in RXRPC, from David Howells.
11) Support geneve RX offload in ixgbe, from Emil Tantilov.
12) When hitting pressure for new incoming TCP data SKBs, perform a
partial rathern than a full purge of the OFO queue (which could be
huge). From Eric Dumazet.
13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.
14) Support RX network flow classification to igb, from Gangfeng Huang.
15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.
16) New skbmod packet action, from Jamal Hadi Salim.
17) Remove some inefficiencies in snmp proc output, from Jia He.
18) Add FIB notifications to properly propagate route changes to
hardware which is doing forwarding offloading. From Jiri Pirko.
19) New dsa driver for qca8xxx chips, from John Crispin.
20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
Żenczykowski.
21) Add L3 mode to ipvlan, from Mahesh Bandewar.
22) Support 802.1ad in mlx4, from Moshe Shemesh.
23) Support hardware LRO in mediatek driver, from Nelson Chang.
24) Add TC offloading to mlx5, from Or Gerlitz.
25) Convert various drivers to ethtool ksettings interfaces, from
Philippe Reynes.
26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.
27) NAPI support for ath10k, from Rajkumar Manoharan.
28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.
29) UDP replicast support in TIPC, from Richard Alpe.
30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.
31) Support BQL in thunderx driver, from Sunil Goutham.
32) TSO support in alx driver, from Tobias Regnery.
33) Add stream parser engine and use it in kcm.
34) Support async DHCP replies in ipconfig module, from Uwe
Kleine-König.
35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
mlxsw: switchx2: Fix misuse of hard_header_len
mlxsw: spectrum: Fix misuse of hard_header_len
net/faraday: Stop NCSI device on shutdown
net/ncsi: Introduce ncsi_stop_dev()
net/ncsi: Rework the channel monitoring
net/ncsi: Allow to extend NCSI request properties
net/ncsi: Rework request index allocation
net/ncsi: Don't probe on the reserved channel ID (0x1f)
net/ncsi: Introduce NCSI_RESERVED_CHANNEL
net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
net: Add netdev all_adj_list refcnt propagation to fix panic
net: phy: Add Edge-rate driver for Microsemi PHYs.
vmxnet3: Wake queue from reset work
i40e: avoid NULL pointer dereference and recursive errors on early PCI error
qed: Add RoCE ll2 & GSI support
qed: Add support for memory registeration verbs
qed: Add support for QP verbs
qed: PD,PKEY and CQ verb support
qed: Add support for RoCE hw init
qede: Add qedr framework
...
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull s390 updates from Martin Schwidefsky:
"The new features and main improvements in this merge for v4.9
- Support for the UBSAN sanitizer
- Set HAVE_EFFICIENT_UNALIGNED_ACCESS, it improves the code in some
places
- Improvements for the in-kernel fpu code, in particular the overhead
for multiple consecutive in kernel fpu users is recuded
- Add a SIMD implementation for the RAID6 gen and xor operations
- Add RAID6 recovery based on the XC instruction
- The PCI DMA flush logic has been improved to increase the speed of
the map / unmap operations
- The time synchronization code has seen some updates
And bug fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
s390/con3270: fix insufficient space padding
s390/con3270: fix use of uninitialised data
MAINTAINERS: update DASD maintainer
s390/cio: fix accidental interrupt enabling during resume
s390/dasd: add missing \n to end of dev_err messages
s390/config: Enable config options for Docker
s390/dasd: make query host access interruptible
s390/dasd: fix panic during offline processing
s390/dasd: fix hanging offline processing
s390/pci_dma: improve lazy flush for unmap
s390/pci_dma: split dma_update_trans
s390/pci_dma: improve map_sg
s390/pci_dma: simplify dma address calculation
s390/pci_dma: remove dma address range check
iommu/s390: simplify registration of I/O address translation parameters
s390: migrate exception table users off module.h and onto extable.h
s390: export header for CLP ioctl
s390/vmur: fix irq pointer dereference in int handler
s390/dasd: add missing KOBJ_CHANGE event for unformatted devices
s390: enable UBSAN
...
Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:
- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.
- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.
- Handle unbalanced hotplug enable/disable calls more gracefully.
- Remove the obsolete CPU_STARTING/DYING notifier support.
- Convert another batch of notifier users.
The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.
The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...
Pull low-level x86 updates from Ingo Molnar:
"In this cycle this topic tree has become one of those 'super topics'
that accumulated a lot of changes:
- Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
x86 - preceded by an array of changes. v4.8 saw preparatory changes
in this area already - this is the rest of the work. Includes the
thread stack caching performance optimization. (Andy Lutomirski)
- switch_to() cleanups and all around enhancements. (Brian Gerst)
- A large number of dumpstack infrastructure enhancements and an
unwinder abstraction. The secret long term plan is safe(r) live
patching plus maybe another attempt at debuginfo based unwinding -
but all these current bits are standalone enhancements in a frame
pointer based debug environment as well. (Josh Poimboeuf)
- More __ro_after_init and const annotations. (Kees Cook)
- Enable KASLR for the vmemmap memory region. (Thomas Garnier)"
[ The virtually mapped stack changes are pretty fundamental, and not
x86-specific per se, even if they are only used on x86 right now. ]
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
x86/asm: Get rid of __read_cr4_safe()
thread_info: Use unsigned long for flags
x86/alternatives: Add stack frame dependency to alternative_call_2()
x86/dumpstack: Fix show_stack() task pointer regression
x86/dumpstack: Remove dump_trace() and related callbacks
x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
oprofile/x86: Convert x86_backtrace() to use the new unwinder
x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
perf/x86: Convert perf_callchain_kernel() to use the new unwinder
x86/unwind: Add new unwind interface and implementations
x86/dumpstack: Remove NULL task pointer convention
fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
lib/syscall: Pin the task stack in collect_syscall()
x86/process: Pin the target stack in get_wchan()
x86/dumpstack: Pin the target stack when dumping it
kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
sched/core: Add try_get_task_stack() and put_task_stack()
x86/entry/64: Fix a minor comment rebase error
iommu/amd: Don't put completion-wait semaphore on stack
...
Pull EFI updates from Ingo Molnar:
"Main changes in this cycle were:
- Refactor the EFI memory map code into architecture neutral files
and allow drivers to permanently reserve EFI boot services regions
on x86, as well as ARM/arm64. (Matt Fleming)
- Add ARM support for the EFI ESRT driver. (Ard Biesheuvel)
- Make the EFI runtime services and efivar API interruptible by
swapping spinlocks for semaphores. (Sylvain Chouleur)
- Provide the EFI identity mapping for kexec which allows kexec to
work on SGI/UV platforms with requiring the "noefi" kernel command
line parameter. (Alex Thorlton)
- Add debugfs node to dump EFI page tables on arm64. (Ard Biesheuvel)
- Merge the EFI test driver being carried out of tree until now in
the FWTS project. (Ivan Hu)
- Expand the list of flags for classifying EFI regions as "RAM" on
arm64 so we align with the UEFI spec. (Ard Biesheuvel)
- Optimise out the EFI mixed mode if it's unsupported (CONFIG_X86_32)
or disabled (CONFIG_EFI_MIXED=n) and switch the early EFI boot
services function table for direct calls, alleviating us from
having to maintain the custom function table. (Lukas Wunner)
- Miscellaneous cleanups and fixes"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/efi: Round EFI memmap reservations to EFI_PAGE_SIZE
x86/efi: Allow invocation of arbitrary boot services
x86/efi: Optimize away setup_gop32/64 if unused
x86/efi: Use kmalloc_array() in efi_call_phys_prolog()
efi/arm64: Treat regions with WT/WC set but WB cleared as memory
efi: Add efi_test driver for exporting UEFI runtime service interfaces
x86/efi: Defer efi_esrt_init until after memblock_x86_fill
efi/arm64: Add debugfs node to dump UEFI runtime page tables
x86/efi: Remove unused find_bits() function
fs/efivarfs: Fix double kfree() in error path
x86/efi: Map in physical addresses in efi_map_region_fixed
lib/ucs2_string: Speed up ucs2_utf8size()
firmware-gsmi: Delete an unnecessary check before the function call "dma_pool_destroy"
x86/efi: Initialize status to ensure garbage is not returned on small size
efi: Replace runtime services spinlock with semaphore
efi: Don't use spinlocks for efi vars
efi: Use a file local lock for efivars
efi/arm*: esrt: Add missing call to efi_esrt_init()
efi/esrt: Use memremap not ioremap to access ESRT table in memory
x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data
...
kbuild test robot reports:
lib/dma-debug.c: In function 'debug_dma_map_resource':
>> lib/dma-debug.c:1541:16: error: implicit declaration of function '__phys_to_pfn' [-Werror=implicit-function-declaration]
entry->pfn = __phys_to_pfn(addr);
^~~~~~~~~~~~~
ia64 does not provide __phys_to_pfn(), use the PHYS_PFN() alias.
Fixes: 0e74b34dfc ("dma-debug: add support for resource mappings")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
put_cpu_var takes the percpu data, not the data returned from
get_cpu_var.
This doesn't change the behavior.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* the only remaining callers of "short" fault-ins are just as happy with generic
variants (both in lib/iov_iter.c); switch them to multipage variants, kill the
"short" ones
* rename the multipage variants to now available plain ones.
* get rid of compat macro defining iov_iter_fault_in_multipage_readable by
expanding it in its only user.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Specifying the aligned attributes to the char data[NDISKS][PAGE_SIZE],
char recovi[PAGE_SIZE] and char recovi[PAGE_SIZE] arrays, so that all
malloc memory is page boundary aligned.
Without these alignment attributes, the test causes a segfault in
userspace when the NDISKS are changed to 4 from 16.
The RAID stripes will be page aligned anyway, so we want to test what
the kernel actually will execute.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
A MMIO mapped resource can not be represented by a struct page so a new
debug type is needed to handle this. This patch add such type and
functionality to add/remove entries and how to translate them to a
physical address.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
The fixes to the radix tree test suite show that the multi-order case is
broken. The basic reason is that the radix tree code uses tagged
pointers with the "internal" bit in the low bits, and calculating the
pointer indices was supposed to mask off those bits. But gcc will
notice that we then use the index to re-create the pointer, and will
avoid doing the arithmetic and use the tagged pointer directly.
This cleans the code up, using the existing is_sibling_entry() helper to
validate the sibling pointer range (instead of open-coding it), and
using entry_to_node() to mask off the low tag bit from the pointer. And
once you do that, you might as well just use the now cleaned-up pointer
directly.
[ Side note: the multi-order code isn't actually ever used in the kernel
right now, and the only reason I didn't just delete all that code is
that Kirill Shutemov piped up and said:
"Well, my ext4-with-huge-pages patchset[1] uses multi-order entries.
It also converts shmem-with-huge-pages and hugetlb to them.
I'm okay with converting it to other mechanism, but I need
something. (I looked into Konstantin's RFC patchset[2]. It looks
okay, but I don't feel myself qualified to review it as I don't
know much about radix-tree internals.)"
[1] http://lkml.kernel.org/r/20160915115523.29737-1-kirill.shutemov@linux.intel.com
[2] http://lkml.kernel.org/r/147230727479.9957.1087787722571077339.stgit@zurg ]
Reported-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Cedric Blancher <cedric.blancher@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the indefinitiley -> indefinitely typo in Kconfig.debug.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160922205513.17821-1-vivien.didelot@savoirfairelinux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Optimize RAID6 xor_syndrome functions to take advantage of the 512-bit
ZMM integer instructions introduced in AVX512.
AVX512 optimized xor_syndrome functions, which is simply based on sse2.c
written by hpa.
The patch was tested and benchmarked before submission on
a hardware that has AVX512 flags to support such instructions
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Adding avx512 gen_syndrome and recovery functions so as to allow code to
be compiled and tested successfully in userspace.
This patch is tested in userspace and improvement in performace is
observed.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Optimize RAID6 recovery functions to take advantage of
the 512-bit ZMM integer instructions introduced in AVX512.
AVX512 optimized recovery functions, which is simply based
on recov_avx2.c written by Jim Kukunas
This patch was tested and benchmarked before submission on
a hardware that has AVX512 flags to support such instructions
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Optimize RAID6 gen_syndrom functions to take advantage of
the 512-bit ZMM integer instructions introduced in AVX512.
AVX512 optimized gen_syndrom functions, which is simply based
on avx2.c written by Yuanhan Liu and sse2.c written by hpa.
The patch was tested and benchmarked before submission on
a hardware that has AVX512 flags to support such instructions
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This commit introduces a generic library to estimate either the min or
max value of a time-varying variable over a recent time window. This
is code originally from Kathleen Nichols. The current form of the code
is from Van Jacobson.
A single struct minmax_sample will track the estimated windowed-max
value of the series if you call minmax_running_max() or the estimated
windowed-min value of the series if you call minmax_running_min().
Nearly equivalent code is already in place for minimum RTT estimation
in the TCP stack. This commit extracts that code and generalizes it to
handle both min and max. Moving the code here reduces the footprint
and complexity of the TCP code base and makes the filter generally
available for other parts of the codebase, including an upcoming TCP
congestion control module.
This library works well for time series where the measurements are
smoothly increasing or decreasing.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some architectures use a hardware defined structure at address zero.
Checking for a null pointer will result in many ubsan reports.
Allow users to disable the null sanitizer.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The insecure_elasticity setting is an ugly wart brought out by
users who need to insert duplicate objects (that is, distinct
objects with identical keys) into the same table.
In fact, those users have a much bigger problem. Once those
duplicate objects are inserted, they don't have an interface to
find them (unless you count the walker interface which walks
over the entire table).
Some users have resorted to doing a manual walk over the hash
table which is of course broken because they don't handle the
potential existence of multiple hash tables. The result is that
they will break sporadically when they encounter a hash table
resize/rehash.
This patch provides a way out for those users, at the expense
of an extra pointer per object. Essentially each object is now
a list of objects carrying the same key. The hash table will
only see the lists so nothing changes as far as rhashtable is
concerned.
To use this new interface, you need to insert a struct rhlist_head
into your objects instead of struct rhash_head. While the hash
table is unchanged, for type-safety you'll need to use struct
rhltable instead of struct rhashtable. All the existing interfaces
have been duplicated for rhlist, including the hash table walker.
One missing feature is nulls marking because AFAIK the only potential
user of it does not need duplicate objects. Should anyone need
this it shouldn't be too hard to add.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Install the callbacks via the state machine.
This is just a temporary vehicle to keep the interface working for now,
It'll be replaced by the sysfs interface which allows to step through the
hotplug state machine step by step.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160906170457.32393-15-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Variable weight is not being initialized to zero before it is
used to compute the weight sum. Ensure it is initialized to zero.
Found with static analysis with cppcheck:
[lib/sbitmap.c:177]: (error) Uninitialized variable: weight
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
... by turning it into what used to be multipages counterpart
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we have a bunch of high-numbered bits allocated and then we resize
the struct sbitmap_queue, when those bits get cleared, we'll update the
hint and then have to re-randomize it repeatedly. Avoid that by checking
that the cleared bit is still a valid hint. No measurable performance
difference in the common case.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that workqueue can handle work item queueing from very early
during boot, there is no need to gate schedule_work() while
!keventd_up(). Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
After a struct sbitmap_queue is resized smaller, the allocation hints
may still be set to bits beyond the new depth of the bitmap. This means
that, for example, if the number of blk-mq tags is reduced through
sysfs, more requests than the nominal queue depth may be in flight.
It's tempting to fix this at resize time by doing a one-time
reinitialization of the hints, but this can race with
__sbitmap_queue_get() updating the hint. Instead, check the hint before
we use it. This caused no measurable performance difference in my
synthetic benchmarks.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to get good cache behavior from a sbitmap, we want each CPU to
stick to its own cacheline(s) as much as possible. This might happen
naturally as the bitmap gets filled up and the alloc_hint values spread
out, but we really want this behavior from the start. blk-mq apparently
intended to do this, but the code to do this was never wired up. Get rid
of the dead code and make it part of the sbitmap library.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Again, there's no point in passing this in every time. Make it part of
struct sbitmap_queue and clean up the API.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The original bt_alloc() we converted from was using kzalloc(), not
kzalloc_node(), to allocate the wait queues. This was probably an
oversight, so fix it for sbitmap_queue_init_node().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.
The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.
This should be a complete noop functionality-wise.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This will avoid a potential read-after-free if collect_syscall()
(e.g. /proc/PID/syscall) is called on an exiting task.
Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/0bfd8e6d4729c97745d3781a29610a33d0a8091d.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan
info from skb->vlan_tci") made flow dissector look at vlan_proto
when vlan is present. Since test_bpf sets skb->vlan_tci to ~0
(including VLAN_TAG_PRESENT) we have to populate skb->vlan_proto.
Fixes false negative on test #24:
test_bpf: #24 LD_PAYLOAD_OFF jited:0 175 ret 0 != 42 FAIL (1 times)
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to calculate the string length on every loop iteration.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Peter Jones <pjones@redhat.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.
More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.
2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.
6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.
8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.
9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.
10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.
12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.
14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.
15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.
16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.
17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.
18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.
19) Add a garbage collector to get rid of stale conntracks, from
Florian.
20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Some versions of gcc don't like tests for the value of an undefined
preprocessor symbol, even in the #else branch of an #ifndef:
lib/test_hash.c:224:7: warning: "HAVE_ARCH__HASH_32" is not defined [-Wundef]
#elif HAVE_ARCH__HASH_32 != 1
^
lib/test_hash.c:229:7: warning: "HAVE_ARCH_HASH_32" is not defined [-Wundef]
#elif HAVE_ARCH_HASH_32 != 1
^
lib/test_hash.c:234:7: warning: "HAVE_ARCH_HASH_64" is not defined [-Wundef]
#elif HAVE_ARCH_HASH_64 != 1
^
Seen with gcc 4.9, not seen with 4.1.2.
Change the logic to only check the value inside an #ifdef to fix this.
Fixes: 468a942852 ("<linux/hash.h>: Add support for architecture-specific functions")
Link: http://lkml.kernel.org/r/20160829214952.1334674-4-arnd@arndb.de
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: George Spelvin <linux@sciencehorizons.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The XC instruction can be used to improve the speed of the raid6
recovery. The loops now operate on blocks of 256 bytes.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
There are three usercopy warnings which are currently being silenced for
gcc 4.6 and newer:
1) "copy_from_user() buffer size is too small" compile warning/error
This is a static warning which happens when object size and copy size
are both const, and copy size > object size. I didn't see any false
positives for this one. So the function warning attribute seems to
be working fine here.
Note this scenario is always a bug and so I think it should be
changed to *always* be an error, regardless of
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS.
2) "copy_from_user() buffer size is not provably correct" compile warning
This is another static warning which happens when I enable
__compiletime_object_size() for new compilers (and
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS). It happens when object size
is const, but copy size is *not*. In this case there's no way to
compare the two at build time, so it gives the warning. (Note the
warning is a byproduct of the fact that gcc has no way of knowing
whether the overflow function will be called, so the call isn't dead
code and the warning attribute is activated.)
So this warning seems to only indicate "this is an unusual pattern,
maybe you should check it out" rather than "this is a bug".
I get 102(!) of these warnings with allyesconfig and the
__compiletime_object_size() gcc check removed. I don't know if there
are any real bugs hiding in there, but from looking at a small
sample, I didn't see any. According to Kees, it does sometimes find
real bugs. But the false positive rate seems high.
3) "Buffer overflow detected" runtime warning
This is a runtime warning where object size is const, and copy size >
object size.
All three warnings (both static and runtime) were completely disabled
for gcc 4.6 with the following commit:
2fb0815c9e ("gcc4: disable __compiletime_object_size for GCC 4.6+")
That commit mistakenly assumed that the false positives were caused by a
gcc bug in __compiletime_object_size(). But in fact,
__compiletime_object_size() seems to be working fine. The false
positives were instead triggered by #2 above. (Though I don't have an
explanation for why the warnings supposedly only started showing up in
gcc 4.6.)
So remove warning #2 to get rid of all the false positives, and re-enable
warnings #1 and #3 by reverting the above commit.
Furthermore, since #1 is a real bug which is detected at compile time,
upgrade it to always be an error.
Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer
needed.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If vmalloc() was successful, do not attempt a kmalloc_array()
Fixes: 4cf0b354d9 ("rhashtable: avoid large lock-array allocations")
Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies __rhashtable_insert_fast() so it returns the
existing object that clashes with the one that you want to insert.
In case the object is successfully inserted, NULL is returned.
Otherwise, you get an error via ERR_PTR().
This patch adapts the existing callers of __rhashtable_insert_fast()
so they handle this new logic, and it adds a new
rhashtable_lookup_get_insert_key() interface to fetch this existing
object.
nf_tables needs this change to improve handling of EEXIST cases via
honoring the NLM_F_EXCL flag and by checking if the data part of the
mapping matches what we have.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
If we're using CONFIG_VMAP_STACK=y and we manage to point an sg entry
at the stack, then either the sg page will be in highmem or sg_virt()
will return the direct-map alias. In neither case will the existing
check_for_stack() implementation realize that it's a stack page.
Fix it by explicitly checking for stack pages.
This has no effect by itself. It's broken out for ease of review.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/448460622731312298bf19dcbacb1606e75de7a9.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The commit 8f6fd83c6c ("rhashtable:
accept GFP flags in rhashtable_walk_init") added a GFP flag argument
to rhashtable_walk_init because some users wish to use the walker
in an unsleepable context.
In fact we don't need to allocate memory in rhashtable_walk_init
at all. The walker is always paired with an iterator so we could
just stash ourselves there.
This patch does that by introducing a new enter function to replace
the existing init function. This way we don't have to churn all
the existing users again.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
Fietkau.
2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.
3) Fix some tg3 ethtool logic bugs, and one that would cause no
interrupts to be generated when rx-coalescing is set to 0. From
Satish Baddipadige and Siva Reddy Kallam.
4) QLCNIC mailbox corruption and napi budget handling fix from Manish
Chopra.
5) Fix fib_trie logic when walking the trie during /proc/net/route
output than can access a stale node pointer. From David Forster.
6) Several sctp_diag fixes from Phil Sutter.
7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.
8) Checksum fixup fixes in bpf from Daniel Borkmann.
9) Memork leaks in nfnetlink, from Liping Zhang.
10) Use after free in rxrpc, from David Howells.
11) Use after free in new skb_array code of macvtap driver, from Jason
Wang.
12) Calipso resource leak, from Colin Ian King.
13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.
14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.
15) Fix lockdep splats in macsec, from Sabrina Dubroca.
16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
handling.
17) Various tc-action bug fixes, from CONG Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net_sched: allow flushing tc police actions
net_sched: unify the init logic for act_police
net_sched: convert tcf_exts from list to pointer array
net_sched: move tc offload macros to pkt_cls.h
net_sched: fix a typo in tc_for_each_action()
net_sched: remove an unnecessary list_del()
net_sched: remove the leftover cleanup_a()
mlxsw: spectrum: Allow packets to be trapped from any PG
mlxsw: spectrum: Unmap 802.1Q FID before destroying it
mlxsw: spectrum: Add missing rollbacks in error path
mlxsw: reg: Fix missing op field fill-up
mlxsw: spectrum: Trap loop-backed packets
mlxsw: spectrum: Add missing packet traps
mlxsw: spectrum: Mark port as active before registering it
mlxsw: spectrum: Create PVID vPort before registering netdevice
mlxsw: spectrum: Remove redundant errors from the code
mlxsw: spectrum: Don't return upon error in removal path
i40e: check for and deal with non-contiguous TCs
ixgbe: Re-enable ability to toggle VLAN filtering
ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
...
Sander reports following splat after netfilter nat bysrc table got
converted to rhashtable:
swapper/0: page allocation failure: order:3, mode:0x2084020(GFP_ATOMIC|__GFP_COMP)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc1 [..]
[<ffffffff811633ed>] warn_alloc_failed+0xdd/0x140
[<ffffffff811638b1>] __alloc_pages_nodemask+0x3e1/0xcf0
[<ffffffff811a72ed>] alloc_pages_current+0x8d/0x110
[<ffffffff8117cb7f>] kmalloc_order+0x1f/0x70
[<ffffffff811aec19>] __kmalloc+0x129/0x140
[<ffffffff8146d561>] bucket_table_alloc+0xc1/0x1d0
[<ffffffff8146da1d>] rhashtable_insert_rehash+0x5d/0xe0
[<ffffffff819fcfff>] nf_nat_setup_info+0x2ef/0x400
The failure happens when allocating the spinlock array.
Even with GFP_KERNEL its unlikely for such a large allocation
to succeed.
Thomas Graf pointed me at inet_ehash_locks_alloc(), so in addition
to adding NOWARN for atomic allocations this also makes the bucket-array
sizing more conservative.
In commit 095dc8e0c3 ("tcp: fix/cleanup inet_ehash_locks_alloc()"),
Eric Dumazet says: "Budget 2 cache lines per cpu worth of 'spinlocks'".
IOW, consider size needed by a single spinlock when determining
number of locks per cpu. So with 64 byte per cacheline and 4 byte per
spinlock this gives 32 locks per cpu.
Resulting size of the lock-array (sizeof(spinlock) == 4):
cpus: 1 2 4 8 16 32 64
old: 1k 1k 4k 8k 16k 16k 16k
new: 128 256 512 1k 2k 4k 8k
8k allocation should have decent chance of success even
with GFP_ATOMIC, and should not fail with GFP_KERNEL.
With 72-byte spinlock (LOCKDEP):
cpus : 1 2
old: 9k 18k
new: ~2k ~4k
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch targets two things which are related to ->confirm_switch:
1. Init ->confirm_switch pointer with NULL on percpu_ref_init() or
kernel frightfully complains with WARN_ON_ONCE(ref->confirm_switch)
at __percpu_ref_switch_to_atomic if memory chunk was not properly
zeroed.
2. Warn if RCU callback is still in progress on percpu_ref_exit().
The race still exists, because percpu_ref_call_confirm_rcu()
drops ->confirm_switch to NULL early, but that is only a warning
and still the caller is responsible that ref is no longer in
active use. Hopefully that can help to catch incorrect usage
of percpu-refcount.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
percpu_ref initially didn't have explicit mode switching operations.
It started out in percpu mode and switched to atomic mode on kill and
then released. Ensuring that kill operation is initiated only after
init completes was naturally the caller's responsibility.
percpu_ref_reinit() was introduced later but it didn't shift the
synchronization responsibility. Reinit can't be performed until kill
is confirmed, so there was nothing to worry about
synchronization-wise. Also, as both reinit and kill manipulate the
base reference, invocations of the same function couldn't be allowed
to race each other.
The latest additions of percpu_ref_switch_to_atomic/percpu() changed
the situation. These two functions can be called any time as long as
the percpu_ref is between init and exit and thus there are valid valid
usage scenarios where these new functions race with each other or
against reinit/kill. Mostly from inertia, f47ad45784 ("percpu_ref:
decouple switching to percpu mode and reinit") still left
synchronization among percpu mode switching operations to its users.
That the new switch functions can be freely mixed with kill/reinit but
the operations themselves should be synchronized is too subtle a
requirement and led to a very subtle race condition in blk-mq freezing
path.
This patch fixes the situation by introducing percpu_ref_switch_lock
to protect mode switching operations. This ensures that percpu-ref
users don't have to worry about mode changing operations racing
against each other, e.g. switch_to_percpu against kill, as long as the
sequence of operations is valid.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: http://lkml.kernel.org/g/1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
Fixes: f47ad45784 ("percpu_ref: decouple switching to percpu mode and reinit")
Restructure atomic/percpu mode switching.
* The users of __percpu_ref_switch_to_atomic/percpu() now call a new
function __percpu_ref_switch_mode() which calls either of the
original switching functions depending on the current state of
ref->force_atomic and the __PERCPU_REF_DEAD flag. The callers no
longer check whether switching is necessary but always invoke
__percpu_ref_switch_mode().
* !ref->confirm_switch waiting is collected into
__percpu_ref_switch_mode().
This patch doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
When an atomic or percpu switching starts before the previous atomic
switching finishes, the taken behaviors are
* If the new atomic switching has confirmation callback, it waits
for the previous atomic switching to complete.
* If the new percpu switching is the first percpu switching following
the previous atomic switching, it waits the previous atomic
switching to complete.
No percpu_ref user depends on these subtleties. The only meaningful
part is that, if the caller ensures that atomic switching isn't in
progress, mode switching operations can be issued from any context.
This patch pulls the wait logic to the top of both switching functions
so that they always wait for the previous atomic switching to
complete. This makes the behavior simpler and consistent for both
directions and will help allowing concurrent invocations of mode
switching functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reorganize __percpu_ref_switch_to_atomic() so that it looks
structurally similar to __percpu_ref_switch_to_percpu() and relocate
percpu_ref_switch_to_atomic so that the two internal functions are
co-located.
This patch doesn't introduce any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
At the beginning, percpu_ref guaranteed a RCU grace period between a
call to percpu_ref_kill_and_confirm() and the invocation of the
confirmation callback. This guarantee exposed internal implementation
details and got rescinded while switching over to sched RCU; however,
__percpu_ref_switch_to_atomic() still inserts a full sched RCU grace
period even when it can simply wait for the previous attempt.
Remove the unnecessary grace period and perform the confirmation
synchronously for staggered atomic switching attempts. Update
comments accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
When I initially added the unsafe_[get|put]_user() helpers in commit
5b24a7a2aa ("Add 'unsafe' user access functions for batched
accesses"), I made the mistake of modeling the interface on our
traditional __[get|put]_user() functions, which return zero on success,
or -EFAULT on failure.
That interface is fairly easy to use, but it's actually fairly nasty for
good code generation, since it essentially forces the caller to check
the error value for each access.
In particular, since the error handling is already internally
implemented with an exception handler, and we already use "asm goto" for
various other things, we could fairly easily make the error cases just
jump directly to an error label instead, and avoid the need for explicit
checking after each operation.
So switch the interface to pass in an error label, rather than checking
the error value in the caller. Best do it now before we start growing
more users (the signal handling code in particular would be a good place
to use the new interface).
So rather than
if (unsafe_get_user(x, ptr))
... handle error ..
the interface is now
unsafe_get_user(x, ptr, label);
where an error during the user mode fetch will now just cause a jump to
'label' in the caller.
Right now the actual _implementation_ of this all still ends up being a
"if (err) goto label", and does not take advantage of any exception
label tricks, but for "unsafe_put_user()" in particular it should be
fairly straightforward to convert to using the exception table model.
Note that "unsafe_get_user()" is much harder to convert to a clever
exception table model, because current versions of gcc do not allow the
use of "asm goto" (for the exception) with output values (for the actual
value to be fetched). But that is hopefully not a limitation in the
long term.
[ Also note that it might be a good idea to switch unsafe_get_user() to
actually _return_ the value it fetches from user space, but this
commit only changes the error handling semantics ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Looks like a simple copy'n'paste error.
Fixes: 1aa661f5c3 ("rhashtable-test: Measure time to insert, remove & traverse entries")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although dynamic debug is often only used for debug builds, sometimes
its enabled for production builds as well. Minimize its impact by using
jump labels. This reduces the text section by 7000+ bytes in the kernel
image below. It does increase data, but this should only be referenced
when changing the direction of the branches, and hence usually not in
cache.
text data bss dec hex filename
8194852 4879776 925696 14000324 d5a0c4 vmlinux.pre
8187337 4960224 925696 14073257 d6bda9 vmlinux.post
Link: http://lkml.kernel.org/r/d165b465e8c89bc582d973758d40be44c33f018b.1467837322.git.jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dma-mapping core and the implementations do not change the DMA
attributes passed by pointer. Thus the pointer can point to const data.
However the attributes do not have to be a bitfield. Instead unsigned
long will do fine:
1. This is just simpler. Both in terms of reading the code and setting
attributes. Instead of initializing local attributes on the stack
and passing pointer to it to dma_set_attr(), just set the bits.
2. It brings safeness and checking for const correctness because the
attributes are passed by value.
Semantic patches for this change (at least most of them):
virtual patch
virtual context
@r@
identifier f, attrs;
@@
f(...,
- struct dma_attrs *attrs
+ unsigned long attrs
, ...)
{
...
}
@@
identifier r.f;
@@
f(...,
- NULL
+ 0
)
and
// Options: --all-includes
virtual patch
virtual context
@r@
identifier f, attrs;
type t;
@@
t f(..., struct dma_attrs *attrs);
@@
identifier r.f;
@@
f(...,
- NULL
+ 0
)
Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Acked-by: Mark Salter <msalter@redhat.com> [c6x]
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris]
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm]
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp]
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core]
Acked-by: David Vrabel <david.vrabel@citrix.com> [xen]
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb]
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32]
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc]
Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge yet more updates from Andrew Morton:
- the rest of ocfs2
- various hotfixes, mainly MM
- quite a bit of misc stuff - drivers, fork, exec, signals, etc.
- printk updates
- firmware
- checkpatch
- nilfs2
- more kexec stuff than usual
- rapidio updates
- w1 things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
ipc: delete "nr_ipc_ns"
kcov: allow more fine-grained coverage instrumentation
init/Kconfig: add clarification for out-of-tree modules
config: add android config fragments
init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
relay: add global mode support for buffer-only channels
init: allow blacklisting of module_init functions
w1:omap_hdq: fix regression
w1: add helper macro module_w1_family
w1: remove need for ida and use PLATFORM_DEVID_AUTO
rapidio/switches: add driver for IDT gen3 switches
powerpc/fsl_rio: apply changes for RIO spec rev 3
rapidio: modify for rev.3 specification changes
rapidio: change inbound window size type to u64
rapidio/idt_gen2: fix locking warning
rapidio: fix error handling in mbox request/release functions
rapidio/tsi721_dma: advance queue processing from transfer submit call
rapidio/tsi721: add messaging mbox selector parameter
rapidio/tsi721: add PCIe MRRS override parameter
rapidio/tsi721_dma: add channel mask and queue size parameters
...
For more targeted fuzzing, it's better to disable kernel-wide
instrumentation and instead enable it on a per-subsystem basis. This
follows the pattern of UBSAN and allows you to compile in the kcov
driver without instrumenting the whole kernel.
To instrument a part of the kernel, you can use either
# for a single file in the current directory
KCOV_INSTRUMENT_filename.o := y
or
# for all the files in the current directory (excluding subdirectories)
KCOV_INSTRUMENT := y
or
# (same as above)
ccflags-y += $(CFLAGS_KCOV)
or
# for all the files in the current directory (including subdirectories)
subdir-ccflags-y += $(CFLAGS_KCOV)
Link: http://lkml.kernel.org/r/1464008380-11405-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The crc32 test function measures the elapsed time in nanoseconds, but
uses 'struct timespec' for that. We want to remove timespec from the
kernel for y2038 compatibility, and ktime_get_ns() also helps make the
code simpler here.
It is also slightly better to use monontonic time, as we are only
interested in the time difference.
Link: http://lkml.kernel.org/r/20160617143932.3289626-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S . Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a large enough area in the iommu bitmap is found but would span a
boundary we continue the search starting from the next bit position.
For large allocations this can lead to several useless invocations of
bitmap_find_next_zero_area() and iommu_is_span_boundary().
Continue the search from the start of the next segment (which is the
next bit position such that we'll not cross the same segment boundary
again).
Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1606081910070.3211@schleppi
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extend the ratelimiting facility to print the amount of suppressed lines
when it is being released.
This use case is aimed at short-termed, burst-like users for which we
want to output the suppressed lines stats only once, after it has been
disposed of. For an example, see /dev/kmsg usage in a follow-on patch.
Also, change the printk() line we issue on release to not use
"callbacks" as it is misleading: we're not suppressing callbacks but
printk() calls.
This has been separated from a previous patch by Linus.
Link: http://lkml.kernel.org/r/20160716061745.15795-2-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Franck Bui <fbui@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
handle_object_size_mismatch() used %pk to format a kernel pointer with
pr_err(). This seemed to be a misspelling for %pK, but using this to
format a kernel pointer does not make much sence here.
Therefore use %p instead, like in handle_missaligned_access().
Link: http://lkml.kernel.org/r/20160730083010.11569-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Radix trees may be used not only for storing page cache pages, so
unconditionally accounting radix tree nodes to the current memory cgroup
is bad: if a radix tree node is used for storing data shared among
different cgroups we risk pinning dead memory cgroups forever.
So let's only account radix tree nodes if it was explicitly requested by
passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to
account page cache entries, so mark mapping->page_tree so.
Fixes: 58e698af4c ("radix-tree: account radix_tree_node to memory cgroup")
Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull kbuild updates from Michal Marek:
- GCC plugin support by Emese Revfy from grsecurity, with a fixup from
Kees Cook. The plugins are meant to be used for static analysis of
the kernel code. Two plugins are provided already.
- reduction of the gcc commandline by Arnd Bergmann.
- IS_ENABLED / IS_REACHABLE macro enhancements by Masahiro Yamada
- bin2c fix by Michael Tautschnig
- setlocalversion fix by Wolfram Sang
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
gcc-plugins: disable under COMPILE_TEST
kbuild: Abort build on bad stack protector flag
scripts: Fix size mismatch of kexec_purgatory_size
kbuild: make samples depend on headers_install
Kbuild: don't add obj tree in additional includes
Kbuild: arch: look for generated headers in obtree
Kbuild: always prefix objtree in LINUXINCLUDE
Kbuild: avoid duplicate include path
Kbuild: don't add ../../ to include path
vmlinux.lds.h: replace config_enabled() with IS_ENABLED()
kconfig.h: allow to use IS_{ENABLE,REACHABLE} in macro expansion
kconfig.h: use already defined macros for IS_REACHABLE() define
export.h: use __is_defined() to check if __KSYM_* is defined
kconfig.h: use __is_defined() to check if MODULE is defined
kbuild: setlocalversion: print error to STDERR
Add sancov plugin
Add Cyclomatic complexity GCC plugin
GCC plugin infrastructure
Shared library support
Pull crypto fixes from Herbert Xu:
"This fixes a number of regressions in the marvell cesa driver caused
by the chaining work, and a regression in lib/mpi that leads to a
GFP_KERNEL allocation with preemption disabled"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: marvell - Don't copy IV vectors from the _process op for ciphers
lib/mpi: Fix SG miter leak
crypto: marvell - Update cache with input sg only when it is unmapped
crypto: marvell - Don't chain at DMA level when backlog is disabled
crypto: marvell - Fix memory leaks in TDMA chain for cipher requests
Pull x86 microcode updates from Thomas Gleixner:
- more work to make the microcode loader robust
- a fix for the micro code load precedence
- fixes for initrd loading with randomized memory
- less printk noise on SMP machines
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm, x86/microcode: Add __PAGE_OFFSET_BASE define on 32-bit
x86/microcode/intel: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y
x86/microcode: Remove unused symbol exports
x86/microcode/intel: Do not issue microcode updates messages on each CPU
Documentation/microcode: Document some aspects for more clarity
x86/microcode/AMD: Make amd_ucode_patch[] static
x86/microcode/intel: Unexport save_mc_for_early()
x86/microcode/intel: Rename load_microcode_early() to find_microcode_patch()
x86/microcode: Propagate save_microcode_in_initrd() retval
x86/microcode: Get rid of find_cpio_data()'s dummy offset arg
lib/cpio: Make find_cpio_data()'s offset arg optional
x86/microcode: Fix suspend to RAM with builtin microcode
x86/microcode: Fix loading precedence
In mpi_read_raw_from_sgl we may leak the SG miter resouces after
reading the leading zeroes. This patch fixes this by stopping the
iteration once the leading zeroes have been read.
Fixes: 127827b9c2 ("lib/mpi: Do not do sg_virt")
Reported-by: Nicolai Stange <nicstange@gmail.com>
Tested-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Merge more updates from Andrew Morton:
"The rest of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
mm, compaction: simplify contended compaction handling
mm, compaction: introduce direct compaction priority
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
mm, page_alloc: make THP-specific decisions more generic
mm, page_alloc: restructure direct compaction handling in slowpath
mm, page_alloc: don't retry initial attempt in slowpath
mm, page_alloc: set alloc_flags only once in slowpath
lib/stackdepot.c: use __GFP_NOWARN for stack allocations
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
mm, kasan: account for object redzone in SLUB's nearest_obj()
mm: fix use-after-free if memory allocation failed in vma_adjust()
zsmalloc: Delete an unnecessary check before the function call "iput"
mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
mm: optimize copy_page_to/from_iter_iovec
mm: add cond_resched() to generic_swapfile_activate()
Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
mm: hwpoison: remove incorrect comments
make __section_nr() more efficient
...
This (large, atomic) allocation attempt can fail. We expect and handle
that, so avoid the scary warning.
Link: http://lkml.kernel.org/r/20160720151905.GB19146@node.shutemov.name
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For KASAN builds:
- switch SLUB allocator to using stackdepot instead of storing the
allocation/deallocation stacks in the objects;
- change the freelist hook so that parts of the freelist can be put
into the quarantine.
[aryabinin@virtuozzo.com: fixes]
Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
copy_page_to_iter_iovec() and copy_page_from_iter_iovec() copy some data
to userspace or from userspace. These functions have a fast path where
they map a page using kmap_atomic and a slow path where they use kmap.
kmap is slower than kmap_atomic, so the fast path is preferred.
However, on kernels without highmem support, kmap just calls
page_address, so there is no need to avoid kmap. On kernels without
highmem support, the fast path just increases code size (and cache
footprint) and it doesn't improve copy performance in any way.
This patch enables the fast path only if CONFIG_HIGHMEM is defined.
Code size reduced by this patch:
x86 (without highmem) 928
x86-64 960
sparc64 848
alpha 1136
pa-risc 1200
[akpm@linux-foundation.org: use IS_ENABLED(), per Andi]
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221711410.4818@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.
That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.
It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.
Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.
* merge branch 'salted-string-hash':
fs/dcache.c: Save one 32-bit multiply in dcache lookup
vfs: make the string hashes salt the hash
important is the use of a ChaCha20-based CRNG for /dev/urandom, which
is faster, more efficient, and easier to make scalable for
silly/abusive userspace programs that want to read from /dev/urandom
in a tight loop on NUMA systems.
This set of patches also improves entropy gathering on VM's running on
Microsoft Azure, and will take advantage of a hw random number
generator (if present) to initialize the /dev/urandom pool.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXla/8AAoJEPL5WVaVDYGj4xgH/3qDPHpFvynUMbMgg/MRQNt2
K1zhKKQjYuJ465aIiPME21tGC1C5JxO3a9mgDy0pfJmLVvCWRUx9dDtdI59Xkaev
KuVTbqdl8D75lftX3/jF3lXKd5dVLrW2V9hOMcESQqBFuoc1B8sJztR0upQsdGvA
I8+jjxRffSzxDZY4it6p/lqsTdEfwhA+mEc6ztZ+Ccsdlqd+GNCvB/YE4tk+U6x/
tGKaKfuiHSXpR4P9ks/L5gwaIcFQ6Y1rdVqd8YP3iC9YXGFZdJkRXklRRj1mCgWs
YoCDMATTos+6iVONft94fF5pW6HND1F3bxPNo/RpMZca3MlqbiCSij3ky+gqUc4=
=mSBt
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random driver updates from Ted Ts'o:
"A number of improvements for the /dev/random driver; the most
important is the use of a ChaCha20-based CRNG for /dev/urandom, which
is faster, more efficient, and easier to make scalable for
silly/abusive userspace programs that want to read from /dev/urandom
in a tight loop on NUMA systems.
This set of patches also improves entropy gathering on VM's running on
Microsoft Azure, and will take advantage of a hw random number
generator (if present) to initialize the /dev/urandom pool"
(It turns out that the random tree hadn't been in linux-next this time
around, because it had been dropped earlier as being too quiet. Oh
well).
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: strengthen input validation for RNDADDTOENTCNT
random: add backtracking protection to the CRNG
random: make /dev/urandom scalable for silly userspace programs
random: replace non-blocking pool with a Chacha20-based CRNG
random: properly align get_random_int_hash
random: add interrupt callback to VMBus IRQ handler
random: print a warning for the first ten uninitialized random users
random: initialize the non-blocking pool via add_hwgenerator_randomness()