didn't factor in expected 'drop_writes' behavior for read IO).
- A dm-log bio operation flags fix for the broader block changes that
were merged during the 4.8 merge window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXwHX2AAoJEMUj8QotnQNaMdQIAJuCHedIKQxlsCH4BG20thwM
7+kPh68ZWOB5VYpVlm2sn0aJG0t2c2IsM2+AcQrwwcVsTjVkqu4s5XeqhBhkhvBE
xrRHdJU21K6ho3IFiMhscZYfhMGvptwddevOxnRLfCgBALTjWpCWCEeQWLe17QCt
klR0bvGckLp7dJavYmb/8MO7VqIQQufYCDjYqEdq4IQT+lKVf940X1bNx5+RpzAD
OCgFwmWFb1OWYsVKWnVqxL+QzQcIA84YpBMV+FKQSTDNTLYgDM1mPTxMOxVMCNLO
neCUh2WNetvoE9s69T/NmPkjzB3hNAmVhbuFT2SBJ7Bnf/lfxT4Zc6WYOeqqWKY=
=XAfD
-----END PGP SIGNATURE-----
Merge tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- another stable fix for DM flakey (that tweaks the previous fix that
didn't factor in expected 'drop_writes' behavior for read IO).
- a dm-log bio operation flags fix for the broader block changes that
were merged during the 4.8 merge window.
* tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm log: fix unitialized bio operation flags
dm flakey: fix reads to be issued if drop_writes configured
From Will Deacon:
* Fix a couple of thinkos in the CMDQ error handling and
short-descriptor page table code that have been there since
day one
* Disable stalling faults, since they may result in hardware
deadlock
* Fix an accidental BUG() when passing disable_bypass=1 on
the cmdline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJXwF/CAAoJECvwRC2XARrjEuAP/1cCyOuG7WuD9TBapWzDp1hf
rkmruS4SO6Jn+YnQ4ob81cE2kV11KdbwaO6pLcyqeH7d9dUzKMpnShEElC+jSeyO
kwD2eQnmZeNtcJNQe8cdjHl0YWN3W3Y+9QRI/cZy+PHK4u/SmGWURtp5EDcsHsbF
RMzF7HzKEA2Uu3CSerCXWY9AHSRFvPTWVJVxbTeeoH8B3NwZihTUj5fJiudG6hqp
YcSB1Y6kMMinydrn9hEtjgw6MHRp8wUwOZxJlQyMuULp22WDiwU+1sOkFAbR/4MA
scCptDHZJA7xm7WYc7UShk/feQNY5lbDXEdEuTv+mJR/nVdM2zUyvcGtItPTshG4
e507pRE4kpXcPvhUatY8pLuCJNGK9rsnRyJLRWeMAGnLodIi2pw2xYsqgPZE8lz/
DPJ6C66Q7M7O+KsU9AI6/IxfwW/FpmF0AMx6fLsnHYQlIGva+0Bdk9tjksRSkxcI
ZbmuUdoUmzPvxe+9XQemOqiskycEbYXQgBc8QxogX13ohYiH1h8axzL8hVKsoKy9
ZsTOf/IuTeO/OXyQZOGfK1m8miFUn9CNe0Ip2D4kpSNnH3Zi5tjXRofaSUwNiPIh
DsXkC6b+H8ibr5/43QfIAggLhQxMFiYSfQKQmpN2qAvMG5b6aYZac25Ep9P+cOn7
W93mu7eYz5ROMmPRUusF
=XLo2
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
"Fixes from Will Deacon:
- fix a couple of thinkos in the CMDQ error handling and
short-descriptor page table code that have been there since day one
- disable stalling faults, since they may result in hardware deadlock
- fix an accidental BUG() when passing disable_bypass=1 on the
cmdline"
* tag 'iommu-fixes-v4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass
iommu/arm-smmu: Disable stalling faults for all endpoints
iommu/arm-smmu: Fix CMDQ error handling
iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks
Pull block fixes from Jens Axboe:
"Here's a set of block fixes for the current 4.8-rc release. This
contains:
- a fix for a secure erase regression, from Adrian.
- a fix for an mmc use-after-free bug regression, also from Adrian.
- potential zero pointer deference in bdev freezing, from Andrey.
- a race fix for blk_set_queue_dying() from Bart.
- a set of xen blkfront fixes from Bob Liu.
- three small fixes for bcache, from Eric and Kent.
- a fix for a potential invalid NVMe state transition, from Gabriel.
- blk-mq CPU offline fix, preventing us from issuing and completing a
request on the wrong queue. From me.
- revert two previous floppy changes, since they caused a user
visibile regression. A better fix is in the works.
- ensure that we don't send down bios that have more than 256
elements in them. Fixes a crash with bcache, for example. From
Ming.
- a fix for deferencing an error pointer with cgroup writeback.
Fixes a regression. From Vegard"
* 'for-linus' of git://git.kernel.dk/linux-block:
mmc: fix use-after-free of struct request
Revert "floppy: refactor open() flags handling"
Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
blk-mq: improve warning for running a queue on the wrong CPU
blk-mq: don't overwrite rq->mq_ctx
block: make sure a big bio is split into at most 256 bvecs
nvme: Fix nvme_get/set_features() with a NULL result pointer
bdev: fix NULL pointer dereference
xen-blkfront: free resources if xlvbd_alloc_gendisk fails
xen-blkfront: introduce blkif_set_queue_limits()
xen-blkfront: fix places not updated after introducing 64KB page granularity
bcache: pr_err: more meaningful error message when nr_stripes is invalid
bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
block: Fix race triggered by blk_set_queue_dying()
block: Fix secure erase
nvme: Prevent controller state invalid transition
For DAX inodes we need to be careful to never have page cache pages in
the mapping->page_tree. This radix tree should be composed only of DAX
exceptional entries and zero pages.
ltp's readahead02 test was triggering a warning because we were trying
to insert a DAX exceptional entry but found that a page cache page had
already been inserted into the tree. This page was being inserted into
the radix tree in response to a readahead(2) call.
Readahead doesn't make sense for DAX inodes, but we don't want it to
report a failure either. Instead, we just return success and don't do
any work.
Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The data offset for a dax region needs to account for a reservation in
the resource range. Otherwise, device-dax is allowing mappings directly
into the memmap or device-info-block area with crash signatures like the
following:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: get_zone_device_page+0x11/0x30
Call Trace:
follow_devmap_pmd+0x298/0x2c0
follow_page_mask+0x275/0x530
__get_user_pages+0xe3/0x750
__gfn_to_pfn_memslot+0x1b2/0x450 [kvm]
tdp_page_fault+0x130/0x280 [kvm]
kvm_mmu_page_fault+0x5f/0xf0 [kvm]
handle_ept_violation+0x94/0x180 [kvm_intel]
vmx_handle_exit+0x1d3/0x1440 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm]
kvm_vcpu_ioctl+0x33c/0x620 [kvm]
do_vfs_ioctl+0xa2/0x5d0
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x1a/0xa4
Fixes: ab68f26221 ("/dev/dax, pmem: direct access to persistent memory")
Link: http://lkml.kernel.org/r/147205536732.1606.8994275381938837346.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Abhilash Kumar Mulumudi <m.abhilash-kumar@hpe.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_read() is a nasty piece of work, not to mention buggy.
It has (I think) an old bug which allows unprivileged userspace to read
beyond the end of m->buf.
I was getting these:
BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880
Read of size 2713 by task trinity-c2/1329
CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
kasan_object_err+0x1c/0x80
kasan_report_error+0x2cb/0x7e0
kasan_report+0x4e/0x80
check_memory_region+0x13e/0x1a0
kasan_check_read+0x11/0x20
seq_read+0xcd2/0x1480
proc_reg_read+0x10b/0x260
do_loop_readv_writev.part.5+0x140/0x2c0
do_readv_writev+0x589/0x860
vfs_readv+0x7b/0xd0
do_readv+0xd8/0x2c0
SyS_readv+0xb/0x10
do_syscall_64+0x1b3/0x4b0
entry_SYSCALL64_slow_path+0x25/0x25
Object at ffff880116889100, in cache kmalloc-4096 size: 4096
Allocated:
PID = 1329
save_stack_trace+0x26/0x80
save_stack+0x46/0xd0
kasan_kmalloc+0xad/0xe0
__kmalloc+0x1aa/0x4a0
seq_buf_alloc+0x35/0x40
seq_read+0x7d8/0x1480
proc_reg_read+0x10b/0x260
do_loop_readv_writev.part.5+0x140/0x2c0
do_readv_writev+0x589/0x860
vfs_readv+0x7b/0xd0
do_readv+0xd8/0x2c0
SyS_readv+0xb/0x10
do_syscall_64+0x1b3/0x4b0
return_from_SYSCALL_64+0x0/0x6a
Freed:
PID = 0
(stack is not available)
Memory state around the buggy address:
ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Disabling lock debugging due to kernel taint
This seems to be the same thing that Dave Jones was seeing here:
https://lkml.org/lkml/2016/8/12/334
There are multiple issues here:
1) If we enter the function with a non-empty buffer, there is an attempt
to flush it. But it was not clearing m->from after doing so, which
means that if we try to do this flush twice in a row without any call
to traverse() in between, we are going to be reading from the wrong
place -- the splat above, fixed by this patch.
2) If there's a short write to userspace because of page faults, the
buffer may already contain multiple lines (i.e. pos has advanced by
more than 1), but we don't save the progress that was made so the
next call will output what we've already returned previously. Since
that is a much less serious issue (and I have a headache after
staring at seq_read() for the past 8 hours), I'll leave that for now.
Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:
mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.
Fixes: 1f47b61fb4 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current wording of the COMPACTION Kconfig help text doesn't
emphasise that disabling COMPACTION might cripple the page allocator
which relies on the compaction quite heavily for high order requests and
an unexpected OOM can happen with the lack of compaction. Make sure we
are vocal about that.
Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 97f2645f35 ("tree-wide: replace config_enabled() with
IS_ENABLED()") mostly killed config_enabled(), but some new users have
appeared for v4.8-rc1. They are all used for a boolean option, so can
be replaced with IS_ENABLED() safely.
Link: http://lkml.kernel.org/r/1471970749-24867-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bbeddf52ad ("printk: move braille console support into separate
braille.[ch] files") moved the parsing of braille-related options into
_braille_console_setup(), changing the type of variable str from char*
to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).
Update the code to make "brl=" option work again and replace memcmp()
with strncmp() to make the compiler able to detect such an issue.
Fixes: bbeddf52ad ("printk: move braille console support into separate braille.[ch] files")
Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While adding proper userfaultfd_wp support with bits in pagetable and
swap entry to avoid false positives WP userfaults through swap/fork/
KSM/etc, I've been adding a framework that mostly mirrors soft dirty.
So I noticed in one place I had to add uffd_wp support to the pagetables
that wasn't covered by soft_dirty and I think it should have.
Example: in the THP migration code migrate_misplaced_transhuge_page()
pmd_mkdirty is called unconditionally after mk_huge_pmd.
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
That sets soft dirty too (it's a false positive for soft dirty, the soft
dirty bit could be more finegrained and transfer the bit like uffd_wp
will do.. pmd/pte_uffd_wp() enforces the invariant that when it's set
pmd/pte_write is not set).
However in the THP split there's no unconditional pmd_mkdirty after
mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
entry is created. The code sets the dirty bit in the struct page
instead of setting it in the pagetable (which is fully equivalent as far
as the real dirty bit is concerned, as the whole point of pagetable bits
is to be eventually flushed out of to the page, but that is not
equivalent for the soft-dirty bit that gets lost in translation).
This was found by code review only and totally untested as I'm working
to actually replace soft dirty and I don't have time to test potential
soft dirty bugfixes as well :).
Transfer the soft_dirty from pmd to pte during THP splits.
This fix avoids losing the soft_dirty bit and avoids userland memory
corruption in the checkpoint.
Fixes: eef1b3ba05 ("thp: implement split_huge_pmd()")
Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels. An entry which we write to here is
xfrm_aevent_rseqth which is u32.
echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth
Commit 230633d109 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs. However, this does not apply to unsigned integers.
Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX. This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
we would need to change the datatype of the entry to 64 bit.
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
{
i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
Introduce a new proc handler proc_douintvec. Individual proc entries
will need to be updated to use the new handler.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Checking command line filenames that are outside the git tree can emit a
noisy and confusing message.
Quiet that message by redirecting stderr.
Verify that the command was executed successfully.
Fixes: 4cad35a7ca ("get_maintainer.pl: reduce need for command-line option -f")
Link: http://lkml.kernel.org/r/1970a1d2fecb258e384e2e4fdaacdc9ccf3e30a4.1470955439.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Wolfram Sang <wsa@the-dreams.de>
Tested-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although sparse declares __builtin_bswap*(), it can't actually do
constant folding inside them (yet). As such, things like
switch (protocol) {
case htons(ETH_P_IP):
break;
}
which we do all over the place cause sparse to warn that it expects a
constant instead of a function call.
Disable __HAVE_BUILTIN_BSWAP*__ if __CHECKER__ is defined to avoid this.
Fixes: 7322dd755e ("byteswap: try to avoid __builtin_constant_p gcc bug")
Link: http://lkml.kernel.org/r/1470914102-26389-1-git-send-email-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Florian Fainelli says:
====================
net: dsa: Make bcm_sf2 utilize b53_common
This patch series makes the bcm_sf2 driver utilize a large number of the core
functions offered by the b53_common driver since the SWITCH_CORE registers are
mostly register compatible with the switches driven by b53_common.
In order to accomplish that, we just override the dsa_driver_ops callbacks that
we need to. There are still integration specific logic from the bcm_sf2 that we
cannot absorb into b53_common because it is just not there, mostly in the area
of link management and power management, but most of the features are within
b53_common now: VLAN, FDB, bridge
Along the process, we also improve support for the BCM58xx SoCs, since those
also have the same version of the switching IP that 7445 has (for which bcm_sf2
was developed).
Changes in v3:
- rebase against 145dd5f9c8 ("net: flush the
softnet backlog in process context")
Changes in v2:
- rebased against "net: dsa: rename switch operations structure"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we are using b53_common for most VLAN, FDB and bridge
operations, delete all the redundant code that we had in bcm_sf2.c to
keep only the integration specific logic that we have to deal with:
power management, link management and the external interfaces (RGMII,
MDIO).
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Broadcom Starfighter2 is almost entirely register compatible with
B53, yet for historical reasons came up first in the tree and is now
being updated to utilize b53_common.c to the fullest extent possible. A
few things need to be adjusted to allow that:
- the switch "core" registers currently operate on a 32-bit address,
whereas b53 passes a page + reg pair to offset from, so we need to
convert that, thankfully there is a generic formula to do that
- the link managemenent is not self contained with the B53/CORE register
set, but instead is in the SWITCH_REG block which is part of the
integration glue logic, so we keep that entirely custom here because
this really is part of the existing bcm_sf2 implementation
- there are additional power management constraints on the port's
memories that make us keep the port_enable/disable callbacks custom
for now, also, we support tagging whereas b53_common does not support
that yet
All the VLAN and bridge code is entirely identical though so, avoid
duplicating it. Other things will be migrated in the future like EEE and
possibly Wake-on-LAN.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to migrate the bcm_sf2 driver over to the b53 driver for most
VLAN/FDB/bridge operations, we need to add support for the "join all
VLANs" register and behavior which allows us to make a given port join
all VLANs and avoid setting specific VLAN entries when it is leaving the
bridge.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 58xx and 7445 chips use the Starfighter2 code, define its MIB layout
and introduce a helper function: is58xx() which checks for both of these
IDs for now.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate a device entry for the Broadcom BCM7445 integrated switch
currently backed by bcm_sf2.c. Since this is the latest generation, it
has 4 ARL entries, 4K VLANs and uses Port 8 for the CPU/IMP port.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow drivers to override specific dsa_switch_driver
callbacks, initialize ds->ops to b53_switch_ops earlier, which avoids
having to expose this structure to glue drivers.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko says:
====================
mlxsw: Introduce support for offload forward mark
Ido says:
This patchset enables the forwarding of certain control packets by the
device instead of relying on the CPU to do the forwarding.
The first two patches simplify the current switchdev offload forward
infrastructure and make it usable for stacked devices. This is done by
moving the packet and port marking to the bridge driver instead of the
switch driver.
Patches 3-5 add the mlxsw specific bits to support the forward mark.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of trapping certain packets to the CPU and then relying on it to
flood them we can instead make the device mirror them.
The following packet types are mirrored:
* DHCP: Broadcast packets that should be flooded by the device, but also
trapped in case CPU is running the DHCP server.
* IGMP query: Multicast packets that need to be forwarded to other
bridge ports, but also trapped so that receiving netdev will be marked
as a router port by the bridge driver.
* ARP request: Broadcast packets that should be forwarded to other
bridge ports, but also trapped in case requested IP is of the local
machine.
* ARP response: Unicast packets that should be forwarded by the bridge
but also trapped in case response is directed at us.
Set the trap action of such packets to mirror and mark them using
'offload_fwd_mark' to prevent the bridge driver from forwarding them
itself.
Note that OSPF packets are also marked despite their action being trap.
The reason for this is that the device traps such packets in the
pipeline after they were already flooded.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now we only trapped packets to CPU, but we are going to allow
some packets to be mirrored (trap & forward) to CPU.
Extend the Rx listener with 'action' member.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of copying & pasting the same struct initialization for every
Rx listener, just use a macro.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.
It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.
This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.
The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.
Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).
Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev_port_same_parent_id() currently expects port netdevs, but we
need it to support stacked devices in the next patch, so drop the
NO_RECURSE flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When team is in bridge and LACP is utilized, LACPDU packets are pushed
to userspace using raw socket and there they are processed. However,
since 8626c56c82, LACPDU skbs are dropped by bridge rx_handler so
they never reach packet handlers in rx path. Fix this by explicity treat
LACPDUs to be pushed to exact delivery in team rx_handler.
Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unused and useless priv_size member from struct devlink_ops.
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in process_backlog(), the process_queue dequeuing is
performed with local IRQ disabled, to protect against
flush_backlog(), which runs in hard IRQ context.
This patch moves the flush operation to a work queue and runs the
callback with bottom half disabled to protect the process_queue
against dequeuing.
Since process_queue is now always manipulated in bottom half context,
the irq disable/enable pair around the dequeue operation are removed.
To keep the flush time as low as possible, the flush
works are scheduled on all online cpu simultaneously, using the
high priority work-queue and statically allocated, per cpu,
work structs.
Overall this change increases the time required to destroy a device
to improve slightly the packets reinjection performances.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I added support to export the vlan entry flags via xstats I forgot to
add support for the pvid since it is manually matched, so check if the
entry matches the vlan_group's pvid and set the flag appropriately.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ppe_cb->ppe_common_cb is being dereferenced before a null check is
being made on it. If ppe_cb->ppe_common_cb is null then we end up
with a null pointer dereference when assigning dsaf_dev. Fix this
by moving the initialisation of dsaf_dev once we know
ppe_cb->ppe_common_cb is OK to dereference.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b17c706987 ("loopback: sctp: add NETIF_F_SCTP_CSUM to device
features") added NETIF_F_SCTP_CRC to device features for lo device to
improve the performance of sctp over lo.
This patch is to add NETIF_F_SCTP_CRC to device features for veth to
improve the performance of sctp over veth.
Before this patch:
ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
212992 212992 10240 10.00 1117.16
After this patch:
ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
212992 212992 10240 10.20 1415.22
Tested-by: Li Shuang <tjlishuang@yeah.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the current kernel, `dlm_tool lockdebug` fails as below:
"dlm_tool lockdebug ED0BD86DCE724393918A1AE8FDBF1EE3
can't open /sys/kernel/debug/dlm/ED0BD86DCE724393918A1AE8FDBF1EE3:
Operation not permitted"
This is because table_open() depends on file->f_op to tell which
seq_file ops should be passed down. But, the original file ops in
file->f_op is replaced by "debugfs_full_proxy_file_operations" with
commit 49d200deaa ("debugfs: prevent access to removed files'
private data").
Currently, I can think up 2 solutions: 1st, replace
debugfs_create_file() with debugfs_create_file_unsafe();
2nd, make different table_open#() accordingly. The 1st one
is neat, but I don't thoroughly understand its risk. Maybe
someone has a better one.
Signed-off-by: Eric Ren <zren@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The bootloader (U-boot) sometimes uses this timer for various delays.
It uses it as a ongoing counter, and does comparisons on the current
counter value. The timer counter is never stopped.
In some cases when the user interacts with the bootloader, or lets
it idle for some time before loading Linux, the timer may expire,
and an interrupt will be pending. This results in an unexpected
interrupt when the timer interrupt is enabled by the kernel, at
which point the event_handler isn't set yet. This results in a NULL
pointer dereference exception, panic, and no way to reboot.
Clear any pending interrupts after we stop the timer in the probe
function to avoid this.
Cc: stable@vger.kernel.org
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Driver init code incorrectly uses the block base address and as a result
clears clocksource structure's fields instead of the hardware registers.
Commit 09a9982016 ("timekeeping: Lift clocksource cacheline
restriction") has changed the offsets within pistachio_clocksource
structure and what has previously gone unnoticed now leads to a kernel
panic during boot.
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
mck is needed to get the PIT working. Explicitly prepare_enable it instead
of assuming it is enabled.
This solves an issue where the system is freezing when the ETM/ETB drivers
are enabled.
Reported-by: Olivier Schonken <olivier.schonken@gmail.com>
Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
When cp_rx_poll does not get enough packet, it will check the rx
interrupt status again. If so, it will jumpt to rx_status_loop again.
But the goto jump resets the rx variable as zero too.
As a result, it causes one possible deadloop. Assume this case,
rx_status_loop only gets the packet count which is less than budget,
and (cpr16(IntrStatus) & cp_rx_intr_mask) condition is always true.
It causes the deadloop happens and system is blocked.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes a common flow for Client instance open during init
and reset path. The Client subtask can handle both the cases instead of
making a separate notify_client_of_open call.
Also it may fix a bug during reset where the service task was leaking
some memory and causing issues.
Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts:
commit 33c133cc75 ("phy: IRQ cannot be shared")
On hardware with multiple PHY devices hooked up to the same IRQ line, allow
them to share it.
Sergei Shtylyov says:
"I'm not sure now what was the reason I concluded that the IRQ sharing
was impossible... most probably I thought that the kernel IRQ handling
code exited the loop over the IRQ actions once IRQ_HANDLED was returned
-- which is obviously not so in reality..."
Signed-off-by: Xander Huff <xander.huff@ni.com>
Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we keep shadow copies of which interrupt sources are enabled
through the intrl2_*_mask_{set,clear} macros, make sure that the
ordering in which we do these two operations: update the copy, then
unmask the register is correct.
This is not currently a problem because we actually do not use them, but
we will in a subsequent patch optimizing register accesses, so better be
safe here.
Fixes: 80105befdb ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We kept shadow copies of which interrupt sources we have enabled and
disabled, but due to an order bug in how intrl2_mask_clear was defined,
we could run into the following scenario:
CPU0 CPU1
intrl2_1_mask_clear(..)
sets INTRL2_CPU_MASK_CLEAR
bcm_sf2_switch_1_isr
read INTRL2_CPU_STATUS and masks with stale
irq1_mask value
updates irq1_mask value
Which would make us loop again and again trying to process and interrupt
we are not clearing since our copy of whether it was enabled before
still indicates it was not. Fix this by updating the shadow copy first,
and then unasking at the HW level.
Fixes: 246d7f773c ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Should qdisc_alloc() fail, we must release the module refcount
we got right before.
Fixes: 6da7c8fcbc ("qdisc: allow setting default queuing discipline")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds SNMP counter for drops caused by MD5 mismatches.
The current syslog might help, but a counter is more precise and helps
monitoring.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP MD5 mismatches do increment sk_drops counter in all states but
SYN_RECV.
This is very unlikely to happen in the real world, but worth adding
to help diagnostics.
We increase the parent (listener) sk_drops.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
drivers/net/vmxnet3/vmxnet3_drv.c:1645:1: warning:
symbol 'vmxnet3_rq_destroy_all_rxdataring' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return error code -ENOMEM from the dma_map_single error
handling case instead of 0, as done elsewhere in this function.
Fixes: 032c5e8284 ("Driver for IBM System i/p VNIC protocol")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>