Commit Graph

901182 Commits

Author SHA1 Message Date
Stefano Garzarella 7143b5ac57 io_uring: prevent sq_thread from spinning when it should stop
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.

Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.

In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.

Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-21 09:16:10 -07:00
Filipe Manana a5ae50dea9 Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:

* the fsync is ranged

* the file has prealloc extents beyond eof

* writeback for a range different from the fsync range starts
  during the fsync

* the size of the file is not sector size aligned

Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.

The following diagram shows how the deadlock can happen.

           CPU 1                                        CPU 2

  btrfs_sync_file()
    --> for range [0, 1MiB)

    --> inode has a size of
        1MiB and has 1 prealloc
        extent beyond the
        i_size, starting at offset
        4MiB

    flushes all delalloc for the
    range [0MiB, 1MiB) and waits
    for the respective ordered
    extents to complete

                                              --> before task at CPU 1 locks the
                                                  inode, a write into file range
                                                  [1MiB, 2MiB + 1KiB) is made

                                              --> i_size is updated to 2MiB + 1KiB

                                              --> writeback is started for that
                                                  range, [1MiB, 2MiB + 4KiB)
                                                  --> end offset rounded up to
                                                      be sector size aligned

    btrfs_log_dentry_safe()
      btrfs_log_inode_parent()
        btrfs_log_inode()

          btrfs_log_changed_extents()
            btrfs_log_prealloc_extents()
              --> does a search on the
                  inode's root
              --> holds a read lock on
                  leaf X

                                              btrfs_finish_ordered_io()
                                                --> locks range [1MiB, 2MiB + 4KiB)
                                                    --> end offset rounded up
                                                        to be sector size aligned

                                                --> tries to cow leaf X, through
                                                    insert_reserved_file_extent()
                                                    --> already locked by the
                                                        task at CPU 1

              btrfs_truncate_inode_items()

                --> gets an i_size of
                    2MiB + 1KiB, which is
                    not sector size
                    aligned

                --> tries to lock file
                    range [2MiB, (u64)-1)
                    --> the start range
                        is rounded down
                        from 2MiB + 1K
                        to 2MiB to be sector
                        size aligned

                    --> but the subrange
                        [2MiB, 2MiB + 4KiB) is
                        already locked by
                        task at CPU 2 which
                        is waiting to get a
                        write lock on leaf X
                        for which we are
                        holding a read lock

                                *** deadlock ***

This results in a stack trace like the following, triggered by test case
generic/561 from fstests:

  [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
  [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
  [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
  [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  [ 2779.990466] Call Trace:
  [ 2779.990491]  ? __schedule+0x384/0xa30
  [ 2779.990521]  schedule+0x33/0xe0
  [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
  [ 2779.990632]  ? remove_wait_queue+0x60/0x60
  [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
  [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
  [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
  [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
  [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
  [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
  [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
  [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
  [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
  [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
  [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
  [ 2779.991432]  process_one_work+0x26d/0x6a0
  [ 2779.991460]  worker_thread+0x4f/0x3e0
  [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
  [ 2779.991489]  kthread+0x103/0x140
  [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
  [ 2779.991515]  ret_from_fork+0x3a/0x50
  (...)
  [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
  [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
  [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
  [ 2780.030038] Call Trace:
  [ 2780.030044]  ? __schedule+0x384/0xa30
  [ 2780.030052]  schedule+0x33/0xe0
  [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
  [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
  [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
  [ 2780.030102]  ? remove_wait_queue+0x60/0x60
  [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
  [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
  [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
  [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
  [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
  [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
  [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
  [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
  [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
  [ 2780.030282]  ? dget_parent+0xa1/0x370
  [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
  [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
  [ 2780.030339]  do_fsync+0x38/0x60
  [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
  [ 2780.030345]  do_syscall_64+0x5c/0x280
  [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
  [ 2780.030361] Code: Bad RIP value.
  [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
  [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
  [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
  [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
  [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
  [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90

So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.

Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-21 16:21:19 +01:00
Logan Gunthorpe 3b7830904e nvme-multipath: Fix memory leak with ana_log_buf
kmemleak reports a memory leak with the ana_log_buf allocated by
nvme_mpath_init():

unreferenced object 0xffff888120e94000 (size 8208):
  comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<00000000e2360188>] kmalloc_order+0x97/0xc0
      [<0000000079b18dd4>] kmalloc_order_trace+0x24/0x100
      [<00000000f50c0406>] __kmalloc+0x24c/0x2d0
      [<00000000f31a10b9>] nvme_mpath_init+0x23c/0x2b0
      [<000000005802589e>] nvme_init_identify+0x75f/0x1600
      [<0000000058ef911b>] nvme_loop_configure_admin_queue+0x26d/0x280
      [<00000000673774b9>] nvme_loop_create_ctrl+0x2a7/0x710
      [<00000000f1c7a233>] nvmf_dev_write+0xc66/0x10b9
      [<000000004199f8d0>] __vfs_write+0x50/0xa0
      [<0000000065466fef>] vfs_write+0xf3/0x280
      [<00000000b0db9a8b>] ksys_write+0xc6/0x160
      [<0000000082156b91>] __x64_sys_write+0x43/0x50
      [<00000000c34fbb6d>] do_syscall_64+0x77/0x2f0
      [<00000000bbc574c9>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

nvme_mpath_init() is called by nvme_init_identify() which is called in
multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
means nvme_mpath_init() may be called multiple times before
nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).

When nvme_mpath_init() is called multiple times, it overwrites the
ana_log_buf pointer with a new allocation, thus leaking the previous
allocation.

To fix this, free ana_log_buf before allocating a new one.

Fixes: 0d0b660f21 ("nvme: add ANA support")
Cc: <stable@vger.kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-02-21 23:52:25 +09:00
Zenghui Yu 2546287c5f genirq/irqdomain: Make sure all irq domain flags are distinct
This was noticed when printing debugfs for MSIs on my ARM64 server.  The
new dstate IRQD_MSI_NOMASK_QUIRK came out surprisingly while it should only
be the x86 stuff for the time being...

The new MSI quirk flag uses the same bit as IRQ_DOMAIN_NAME_ALLOCATED which
is oddly defined as bit 6 for no good reason.

Switch it to the non used bit 1.

Fixes: 6f1a4891a5 ("x86/apic/msi: Plug non-maskable MSI affinity race")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200221020725.2038-1-yuzenghui@huawei.com
2020-02-21 11:29:15 +01:00
Randy Dunlap 4c5fd3b791 zonefs: fix documentation typos etc.
Fix typos, spellos, etc. in zonefs.txt.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-21 18:09:26 +09:00
Guo Ren 0b9f386c4b csky: Implement copy_thread_tls
This is required for clone3 which passes the TLS value through a
struct rather than a register.

Cc: Amanieu d'Antras <amanieu@gmail.com>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
MaJun 5b49c82dad csky: Add PCI support
Add the pci related code for csky arch to support basic pci virtual
function, such as qemu virt-pci-9pfs.

Signed-off-by: MaJun <majun258@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Ma Jun dc2efc0028 csky: Minimize defconfig to support buildroot config.fragment
Some bsp (eg: buildroot) has defconfig.fragment design to add more
configs into the defconfig in linux source code tree. For example,
we could put different cpu configs into different defconfig.fragments,
but they all use the same defconfig in Linux.

Signed-off-by: Ma Jun <majun258@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Guo Ren d46869aaab csky: Add setup_initrd check code
We should give some necessary check for initrd just like other
architectures and it seems that setup_initrd() could be a common
code for all architectures.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Krzysztof Kozlowski 4ec575b785 csky: Cleanup old Kconfig options
CONFIG_CLKSRC_OF is gone since commit bb0eb050a5
("clocksource/drivers: Rename CLKSRC_OF to TIMER_OF").  The platform
already selects TIMER_OF.

CONFIG_HAVE_DMA_API_DEBUG is gone since commit 6e88628d03 ("dma-debug:
remove CONFIG_HAVE_DMA_API_DEBUG").

CONFIG_DEFAULT_DEADLINE is gone since commit f382fb0bce ("block:
remove legacy IO schedulers").

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Randy Dunlap bebd26ab62 arch/csky: fix some Kconfig typos
Fix wording in help text for the CPU_HAS_LDSTEX symbol.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Guo Ren 2305f60b76 csky: Fixup compile warning for three unimplemented syscalls
Implement fstat64, fstatat64, clone3 syscalls to fixup
checksyscalls.sh compile warnings.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Guo Ren 9025fd48a8 csky: Remove unused cache implementation
Only for coding convention, these codes are unnecessary for abiv2.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:25 +08:00
Guo Ren 359ae00d12 csky: Fixup ftrace modify panic
During ftrace init, linux will replace all function prologues
(call_mcout) with nops, but it need flush_dcache and
invalidate_icache to make it work. So flush_cache functions
couldn't be nested called by ftrace framework.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren 997153b9a7 csky: Add flush_icache_mm to defer flush icache all
Some CPUs don't support icache.va instruction to maintain the whole
smp cores' icache. Using icache.all + IPI casue a lot on performace
and using defer mechanism could reduce the number of calling icache
_flush_all functions.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren cc1f6563a9 csky: Optimize abiv2 copy_to_user_page with VM_EXEC
Only when vma is for VM_EXEC, we need sync dcache & icache. eg:
 - gdb ptrace modify user space instruction code area.

Add VM_EXEC condition to reduce unnecessary cache flush.

The abiv1 cpus' cache are all VIPT, so we still need to deal with
dcache aliasing problem. But there is optimized way to use cache
color, just like what's done in arch/csky/abiv1/inc/abi/page.h.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren d936a7e708 csky: Enable defer flush_dcache_page for abiv2 cpus (807/810/860)
Instead of flushing cache per update_mmu_cache() called, we use
flush_dcache_page to reduce the frequency of flashing the cache.

As abiv2 cpus are all PIPT for icache & dcache, we needn't handle
dcache aliasing problem. But their icache can't snoop dcache, so
we still need sync_icache_dcache in update_mmu_cache().

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren a117673413 csky: Remove unnecessary flush_icache_* implementation
The abiv2 CPUs are all PIPT cache, so there is no need to implement
flush_icache_page function.

The function flush_icache_user_range hasn't been used, so just
remove it.

The function flush_cache_range is not necessary for PIPT cache when
tlb mapping changed.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren 761b4f694c csky: Support icache flush without specific instructions
Some CPUs don't support icache specific instructions to flush icache
lines in broadcast way. We use cpu control registers to flush local
icache and use IPI to notify other cores.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren a736fa1ed7 csky/Kconfig: Add Kconfig.platforms to support some drivers
Such as snps,dw-apb-ictl

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren c9492737b2 csky/smp: Fixup boot failed when CONFIG_SMP
If we use a non-ipi-support interrupt controller, it will cause panic here.
We should let cpu up and work with CONFIG_SMP, when we use a non-ipi intc.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren f8e17c17b8 csky: Set regs->usp to kernel sp, when the exception is from kernel
In the past, we didn't care about kernel sp when saving pt_reg. But in some
cases, we still need pt_reg->usp to represent the kernel stack before enter
exception.

For cmpxhg in atomic.S, we need save and restore usp for above.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren 7f4a567332 csky/mm: Fixup export invalid_pte_table symbol
There is no present bit in csky pmd hardware, so we need to prepare invalid_pte_table
for empty pmd entry and the functions (pmd_none & pmd_present) in pgtable.h need
invalid_pte_talbe to get result. If a module use these functions, we need export the
symbol for it.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Cc: Mo Qihui <qihui.mo@verisilicon.com>
Cc: Zhange Jian <zhang_jian5@dahuatech.com>
2020-02-21 15:43:24 +08:00
Guo Ren f136008f31 csky: Separate fixaddr_init from highmem
After fixaddr_init is separated from highmem, we could use tcm
without highmem selected. (610 (abiv1) don't support highmem,
but it could use tcm now.)

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren f525bb2c9e csky: Tightly-Coupled Memory or Sram support
The implementation are not only used by TCM but also used by sram on
SOC bus. It follow existed linux tcm software interface, so that old
tcm application codes could be re-used directly.

Software interface list in asm/tcm.h:
 - Variables/Const: 	__tcmdata, __tcmconst
 - Functions:		__tcmfunc, __tcmlocalfunc
 - Malloc/Free:		tcm_alloc, tcm_free

In linux menuconfig:
 - Choose a TCM contain instrctions + data or separated in ITCM/DTCM.
 - Determine TCM_BASE (DTCM_BASE) in phyiscal address.
 - Determine size of TCM or ITCM(DTCM) in page counts.

Here is hello tcm example from Documentation/arm/tcm.rst which could
be directly used:

/* Uninitialized data */
static u32 __tcmdata tcmvar;
/* Initialized data */
static u32 __tcmdata tcmassigned = 0x2BADBABEU;
/* Constant */
static const u32 __tcmconst tcmconst = 0xCAFEBABEU;

static void __tcmlocalfunc tcm_to_tcm(void)
{
	int i;
	for (i = 0; i < 100; i++)
		tcmvar ++;
}

static void __tcmfunc hello_tcm(void)
{
	/* Some abstract code that runs in ITCM */
	int i;
	for (i = 0; i < 100; i++) {
		tcmvar ++;
	}
	tcm_to_tcm();
}

static void __init test_tcm(void)
{
	u32 *tcmem;
	int i;

	hello_tcm();
	printk("Hello TCM executed from ITCM RAM\n");

	printk("TCM variable from testrun: %u @ %p\n", tcmvar, &tcmvar);
	tcmvar = 0xDEADBEEFU;
	printk("TCM variable: 0x%x @ %p\n", tcmvar, &tcmvar);

	printk("TCM assigned variable: 0x%x @ %p\n", tcmassigned, &tcmassigned);

	printk("TCM constant: 0x%x @ %p\n", tcmconst, &tcmconst);

	/* Allocate some TCM memory from the pool */
	tcmem = tcm_alloc(20);
	if (tcmem) {
		printk("TCM Allocated 20 bytes of TCM @ %p\n", tcmem);
		tcmem[0] = 0xDEADBEEFU;
		tcmem[1] = 0x2BADBABEU;
		tcmem[2] = 0xCAFEBABEU;
		tcmem[3] = 0xDEADBEEFU;
		tcmem[4] = 0x2BADBABEU;
		for (i = 0; i < 5; i++)
			printk("TCM tcmem[%d] = %08x\n", i, tcmem[i]);
		tcm_free(tcmem, 20);
	}
}

TODO:
 - Separate fixup mapping from highmem
 - Support abiv1

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Mao Han 2f78c73f78 csky: Initial stack protector support
This is a basic -fstack-protector support without per-task canary
switching. The protector will report something like when stack
corruption is detected:

It's tested with strcpy local array overflow in sys_kill and get:
stack-protector: Kernel stack is corrupted in: sys_kill+0x23c/0x23c

TODO:
 - Support task switch for different cannary

Signed-off-by: Mao Han <han_mao@c-sky.com>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Guo Ren fd1d98650a MAINTAINERS: csky: Add mailing list for csky
Add mailing list and it's convenient for maintain C-SKY
subsystem.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
2020-02-21 15:43:24 +08:00
Suraj Jitindar Singh df3da4ea5a ext4: fix potential race between s_group_info online resizing and access
During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org
2020-02-21 00:38:12 -05:00
Theodore Ts'o 1d0c3924a9 ext4: fix potential race between online resizing and write operations
During an online resize an array of pointers to buffer heads gets
replaced so it can get enlarged.  If there is a racing block
allocation or deallocation which uses the old array, and the old array
has gotten reused this can lead to a GPF or some other random kernel
memory getting modified.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu
Reported-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-02-21 00:37:09 -05:00
Dave Airlie 97d9a4e961 Merge tag 'drm-intel-fixes-2020-02-20' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
drm/i915 fixes for v5.6-rc3:
- Workaround missing Display Stream Compression (DSC) state readout by
  forcing modeset when its enabled at probe
- Fix EHL port clock voltage level requirements
- Fix queuing retire workers on the virtual engine
- Fix use of partially initialized waiters
- Stop using drm_pci_alloc/drm_pci/free
- Fix rewind of RING_TAIL by forcing a context reload
- Fix locking on resetting ring->head
- Propagate our bug filing URL change to stable kernels

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87y2sxtsrd.fsf@intel.com
2020-02-21 12:46:54 +10:00
Dave Airlie c1368b347f drm-misc-fixes for v5.6-rc3:
- Fix dt binding for sunxi.
 - Allow only 1 rotation argument, and allow 0 rotation in video cmdline.
 - Small compiler warning fix for panfrost.
 - Fix when using performance counters in panfrost when using per fd address space.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAl5OWawACgkQ/lWMcqZw
 E8M3UxAAlUvBWQvrGz6bLD3NANF5nW5KjARPNFhwr7uyI67MExZoBh9WqJejhoQ6
 4dMv9BwT81EXhcaxyNjdshwkHktsvPr/ouEdO256e0UC1hs7zWOEapL7xXlV7dwe
 pvEUUnG07Umh50Df39jcOla4YgnoqYRwW7E7SMadvDuo81UJ6+Daf8we+PO5w2zD
 IbZsOfPBMn5NhyzXgnynlp3Y8df521EDb71x3R4d4vAyOoRE2Axmd7xZuqqkt59P
 BDxj02glXIZsSt5OLTPFQdlv7rYXo/Y52wulBDIDup6N/wD+9jlSMFL+OOQby3rP
 Q5Ve6TkLrSdkZWFVNsMMubylEw+CtNYForZb9J9uo0M7+PsP3tApLP1CYbi5lvI0
 yIff8986H5U8I3DaETugwyPTMdnWnnqRsQN57A8WYbQV5YLSx7bUqV0bgW6pucJP
 yC0e0h7367sgYvtENCIxvQ1sNUxiEz0QfppN1xW55JLsEerghomF8vzQNQJd1/Iy
 4GnHdvsB6NrBH1Ebzu3Ibj5hj5Y15znJlfhgFHuUwY0aiAW5cf4a+wH7EQdTt7T9
 ufBM9DFiySBE4xhffHo8JpEMOQVrabBfZzs8qg0RMT899DMPTpjW2OIoblDfuck0
 7LYfV/xU9qJMSsBA9X4G3+F/cH7EFikdNENEwJ2hyv04unpc/Ww=
 =NWxB
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2020-02-20' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

drm-misc-fixes for v5.6-rc3:
- Fix dt binding for sunxi.
- Allow only 1 rotation argument, and allow 0 rotation in video cmdline.
- Small compiler warning fix for panfrost.
- Fix when using performance counters in panfrost when using per fd address space.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/f5a6370d-9898-6c72-43e4-5bb56a99b6f2@linux.intel.com
2020-02-21 12:31:16 +10:00
David S. Miller 36a44bcdd8 Merge branch 'bnxt_en-shutdown-and-kexec-kdump-related-fixes'
Michael Chan says:

====================
bnxt_en: shutdown and kexec/kdump related fixes.

2 small patches to fix kexec shutdown and kdump kernel driver init issues.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 16:05:42 -08:00
Vasundhara Volam 8743db4a9a bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
If crashed kernel does not shutdown the NIC properly, PCIe FLR
is required in the kdump kernel in order to initialize all the
functions properly.

Fixes: d629522e1d ("bnxt_en: Reduce memory usage when running in kdump kernel.")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 16:05:42 -08:00
Vasundhara Volam 5567ae4a8d bnxt_en: Improve device shutdown method.
Especially when bnxt_shutdown() is called during kexec, we need to
disable MSIX and disable Bus Master to completely quiesce the device.
Make these 2 calls unconditionally in the shutdown method.

Fixes: c20dc142dd ("bnxt_en: Disable bus master during PCI shutdown and driver unload.")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 16:05:42 -08:00
Nikolay Aleksandrov 3a20773bee net: netlink: cap max groups which will be considered in netlink_bind()
Since nl_groups is a u32 we can't bind more groups via ->bind
(netlink_bind) call, but netlink has supported more groups via
setsockopt() for a long time and thus nlk->ngroups could be over 32.
Recently I added support for per-vlan notifications and increased the
groups to 33 for NETLINK_ROUTE which exposed an old bug in the
netlink_bind() code causing out-of-bounds access on archs where unsigned
long is 32 bits via test_bit() on a local variable. Fix this by capping the
maximum groups in netlink_bind() to BITS_PER_TYPE(u32), effectively
capping them at 32 which is the minimum of allocated groups and the
maximum groups which can be bound via netlink_bind().

CC: Christophe Leroy <christophe.leroy@c-s.fr>
CC: Richard Guy Briggs <rgb@redhat.com>
Fixes: 4f52090052 ("netlink: have netlink per-protocol bind function return an error code.")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 16:02:08 -08:00
Tim Harvey 971617c3b7 net: thunderx: workaround BGX TX Underflow issue
While it is not yet understood why a TX underflow can easily occur
for SGMII interfaces resulting in a TX wedge. It has been found that
disabling/re-enabling the LMAC resolves the issue.

Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Robert Jones <rjones@gateworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 15:49:20 -08:00
Shannon Nelson 68b759a75d ionic: fix fw_status read
The fw_status field is only 8 bits, so fix the read.  Also,
we only want to look at the one status bit, to allow for future
use of the other bits, and watch for a bad PCI read.

Fixes: 97ca486592 ("ionic: add heartbeat check")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 15:48:04 -08:00
Linus Torvalds ebe7acadf5 Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull IMA fixes from Mimi Zohar:
 "Two bug fixes and an associated change for each.

  The one that adds SM3 to the IMA list of supported hash algorithms is
  a simple change, but could be considered a new feature"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: add sm3 algorithm to hash algorithm configuration list
  crypto: rename sm3-256 to sm3 in hash_algo_name
  efi: Only print errors about failing to get certs if EFI vars are found
  x86/ima: use correct identifier for SetupMode variable
2020-02-20 15:15:16 -08:00
Roman Kiryanov 98bda63e20 net: disable BRIDGE_NETFILTER by default
The description says 'If unsure, say N.' but
the module is built as M by default (once
the dependencies are satisfied).

When the module is selected (Y or M), it enables
NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS
which alter kernel internal structures.

We (Android Studio Emulator) currently do not
use this module and think this it is more consistent
to have it disabled by default as opposite to
disabling it explicitly to prevent enabling
NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS.

Signed-off-by: Roman Kiryanov <rkir@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 15:02:02 -08:00
Alexandre Belloni ac2fcfa9fd net: macb: Properly handle phylink on at91rm9200
at91ether_init was handling the phy mode and speed but since the switch to
phylink, the NCFGR register got overwritten by macb_mac_config(). The issue
is that the RM9200_RMII bit and the MACB_CLK_DIV32 field are cleared
but never restored as they conflict with the PAE, GBE and PCSSEL bits.

Add new capability to differentiate between EMAC and the other versions of
the IP and use it to set and avoid clearing the relevant bits.

Also, this fixes a NULL pointer dereference in macb_mac_link_up as the EMAC
doesn't use any rings/bufffers/queues.

Fixes: 7897b071ac ("net: macb: convert to phylink")
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 15:00:31 -08:00
Thomas Gleixner 8645e56a4a xen: Enable interrupts when calling _cond_resched()
xen_maybe_preempt_hcall() is called from the exception entry point
xen_do_hypervisor_callback with interrupts disabled.

_cond_resched() evades the might_sleep() check in cond_resched() which
would have caught that and schedule_debug() unfortunately lacks a check
for irqs_disabled().

Enable interrupts around the call and use cond_resched() to catch future
issues.

Fixes: fdfd811ddd ("x86/xen: allow privcmd hypercalls to be preempted")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/878skypjrh.fsf@nanos.tec.linutronix.de
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-02-20 16:40:38 -06:00
David S. Miller 0d5b8d7055 Merge branch 's390-fixes'
Julian Wiedmann says:

====================
s390/qeth: fixes 2020-02-20

please apply the following patch series for qeth to netdev's net tree.

This corrects three minor issues:
1) return a more fitting errno when VNICC cmds are not supported,
2) remove a bogus WARN in the NAPI code, and
3) be _very_ pedantic about the RX copybreak.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:30:47 -08:00
Julian Wiedmann 54a61fbc02 s390/qeth: fix off-by-one in RX copybreak check
The RX copybreak is intended as the _max_ value where the frame's data
should be copied. So for frame_len == copybreak, don't build an SG skb.

Fixes: 4a71df5004 ("qeth: new qeth device driver")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:30:47 -08:00
Julian Wiedmann 420579dba1 s390/qeth: don't warn for napi with 0 budget
Calling napi->poll() with 0 budget is a legitimate use by netpoll.

Fixes: a1c3ed4c9c ("qeth: NAPI support for l2 and l3 discipline")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:30:47 -08:00
Alexandra Winter 6f3846f095 s390/qeth: vnicc Fix EOPNOTSUPP precedence
When getting or setting VNICC parameters, the error code EOPNOTSUPP
should have precedence over EBUSY.

EBUSY is used because vnicc feature and bridgeport feature are mutually
exclusive, which is a temporary condition.
Whereas EOPNOTSUPP indicates that the HW does not support all or parts of
the vnicc feature.
This issue causes the vnicc sysfs params to show 'blocked by bridgeport'
for HW that does not support VNICC at all.

Fixes: caa1f0b10d ("s390/qeth: add VNICC enable/disable support")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:30:47 -08:00
Kees Cook 16a556eeb7 openvswitch: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/openvswitch/flow_netlink.c: In function ‘validate_set’:
net/openvswitch/flow_netlink.c:2711:29: warning: statement will never be executed [-Wswitch-unreachable]
 2711 |  const struct ovs_key_ipv4 *ipv4_key;
      |                             ^~~~~~~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook 46d30cb104 net: ip6_gre: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/ipv6/ip6_gre.c: In function ‘ip6gre_err’:
net/ipv6/ip6_gre.c:440:32: warning: statement will never be executed [-Wswitch-unreachable]
  440 |   struct ipv6_tlv_tnl_enc_lim *tel;
      |                                ^~~

net/ipv6/ip6_tunnel.c: In function ‘ip6_tnl_err’:
net/ipv6/ip6_tunnel.c:520:32: warning: statement will never be executed [-Wswitch-unreachable]
  520 |   struct ipv6_tlv_tnl_enc_lim *tel;
      |                                ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook 161d179261 net: core: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/core/skbuff.c: In function ‘skb_checksum_setup_ip’:
net/core/skbuff.c:4809:7: warning: statement will never be executed [-Wswitch-unreachable]
 4809 |   int err;
      |       ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook 9038ec99ce x86/xen: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

arch/x86/xen/enlighten_pv.c: In function ‘xen_write_msr_safe’:
arch/x86/xen/enlighten_pv.c:904:12: warning: statement will never be executed [-Wswitch-unreachable]
  904 |   unsigned which;
      |            ^~~~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200220062318.69299-1-keescook@chromium.org
Reviewed-by: Juergen Gross <jgross@suse.com>
[boris: made @which an 'unsigned int']
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-02-20 10:16:40 -06:00
Catalin Marinas dcde237319 mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()
Currently the arm64 kernel ignores the top address byte passed to brk(),
mmap() and mremap(). When the user is not aware of the 56-bit address
limit or relies on the kernel to return an error, untagging such
pointers has the potential to create address aliases in user-space.
Passing a tagged address to munmap(), madvise() is permitted since the
tagged pointer is expected to be inside an existing mapping.

The current behaviour breaks the existing glibc malloc() implementation
which relies on brk() with an address beyond 56-bit to be rejected by
the kernel.

Remove untagging in the above functions by partially reverting commit
ce18d171cb ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
addition, update the arm64 tagged-address-abi.rst document accordingly.

Link: https://bugzilla.redhat.com/1797052
Fixes: ce18d171cb ("mm: untag user pointers in mmap/munmap/mremap/brk")
Cc: <stable@vger.kernel.org> # 5.4.x-
Cc: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Victor Stinner <vstinner@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-02-20 10:03:14 +00:00