OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Andre Przywara	fd1d0ddf2a	KVM: arm/arm64: check IRQ number on userland injection When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently only check it against a fixed limit, which historically is set to 127. With the new dynamic IRQ allocation the effective limit may actually be smaller (64). So when now a malicious or buggy userland injects a SPI in that range, we spill over on our VGIC bitmaps and bytemaps memory. I could trigger a host kernel NULL pointer dereference with current mainline by injecting some bogus IRQ number from a hacked kvmtool: ----------------- .... DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1) DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1) DEBUG: IRQ #114 still in the game, writing to bytemap now... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ffffffc07652e000 [00000000] pgd=00000000f658b003, pud=00000000f658b003, *pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027 Hardware name: FVP Base (DT) task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000 PC is at kvm_vgic_inject_irq+0x234/0x310 LR is at kvm_vgic_inject_irq+0x30c/0x310 pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145 ..... So this patch fixes this by checking the SPI number against the actual limit. Also we remove the former legacy hard limit of 127 in the ioctl code. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> CC: <stable@vger.kernel.org> # 4.0, 3.19, 3.18 [maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__, as suggested by Christopher Covington] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>	2015-04-22 15:42:24 +01:00
Eric Auger	0b3289ebc2	KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi irqfd/arm curently does not support routing. kvm_irq_map_gsi is supposed to return all the routing entries associated with the provided gsi and return the number of those entries. We should return 0 at this point. Signed-off-by: Eric Auger <eric.auger@linaro.org> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>	2015-04-22 15:37:54 +01:00
Wolfram Sang	cf82f52d36	watchdog: stmp3xxx_rtc_wdt: fix broken email address My Pengutronix address is not valid anymore, redirect people to the Pengutronix kernel team. Reported-by: Harald Geyer <harald@ccbib.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Robert Schwebel <r.schwebel@pengutronix.de> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:30:45 +02:00
Wolfram Sang	e8cc536657	watchdog: pnx4008_wdt: fix broken email address My Pengutronix address is not valid anymore, redirect people to the Pengutronix kernel team. Reported-by: Harald Geyer <harald@ccbib.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Robert Schwebel <r.schwebel@pengutronix.de> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:30:41 +02:00
Aaro Koskinen	3a30c07e71	watchdog: octeon: use fixed length string for register names Use fixed length string for register names. This saves 416 bytes in text size. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:28:40 +02:00
Aaro Koskinen	8692cf0ad3	watchdog: octeon: fix some trivial coding style issues Fix some trivial coding style issues to reduce noise from static analyzers. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:28:35 +02:00
Aaro Koskinen	3d588c93c0	watchdog: octeon: convert to WATCHDOG_CORE API Convert OCTEON watchdog to WATCHDOG_CORE API. This enables support for multiple watchdogs on OCTEON boards. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:28:31 +02:00
Michal Simek	6290d8c826	watchdog: cadence: Remove Kconfig dependency on ARCH Remove Kconfig dependency and enable driver for all ARCHs. Signed-off-by: Michal Simek <michal.simek@xilinx.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:28:21 +02:00
Mathieu Olivari	cf79fb14d0	ARM: msm: add watchdog entries to DT timer binding doc The watchdog has been reworked to use the same DT node as the timer. This change is updating the device tree doc accordingly. Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:28:16 +02:00
Mathieu Olivari	4ba1c98b55	ARM: qcom: add description of KPSS WDT for IPQ8064 Add the watchdog related entries to the Krait Processor Sub-system (KPSS) timer IPQ8064 devicetree section. Also, add a fixed-clock description of SLEEP_CLK, which will do for now. Signed-off-by: Josh Cartwright <joshc@codeaurora.org> Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:28:11 +02:00
Mathieu Olivari	0dfd582e02	watchdog: qcom: use timer devicetree binding MSM watchdog configuration happens in the same register block as the timer, so we'll use the same binding as the existing timer. The qcom-wdt will now be probed when devicetree has an entry compatible with "qcom,kpss-timer" or "qcom-scss-timer". Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:27:47 +02:00
Joe Perches	e1dbde2960	watchdog: bcm281xx: Remove use of seq_printf return value The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit `1f33c41c03` ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Guenter Roeck <linux~roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2015-04-22 15:27:39 +02:00
Paul Gortmaker	4a3893d069	modpost: don't emit section mismatch warnings for compiler optimizations Currently an allyesconfig build [gcc-4.9.1] can generate the following: WARNING: vmlinux.o(.text.unlikely+0x3864): Section mismatch in reference from the function cpumask_empty.constprop.3() to the variable .init.data:nmi_ipi_mask which comes from the cpumask_empty usage in arch/x86/kernel/nmi_selftest.c. Normally we would not see a symbol entry for cpumask_empty since it is: static inline bool cpumask_empty(const struct cpumask *srcp) however in this case, the variant of the symbol gets emitted when GCC does constant propagation optimization. Fix things up so that any locally optimized constprop variants don't warn when accessing variables that live in the __init sections. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:34 +09:30
Paul Gortmaker	09c20c032b	modpost: expand pattern matching to support substring matches Currently the match() function supports a leading * to match any prefix and a trailing * to match any suffix. However there currently is not a combination of both that can be used to target matches of whole families of functions that share a common substring. Here we expand the foo and foo match to also support foo with the goal of targeting compiler generated symbol names that contain strings like ".constprop." and ".isra." Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:33 +09:30
Quentin Casasnovas	c5c3439af0	modpost: do not try to match the SHT_NUL section. Trying to match the SHT_NUL section isn't useful and causes build failures on parisc and mn10300 since the addition of section strict white-listing and __ex_table sanitizing. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: `050e57fd59` ("modpost: add strict white-listing when referencing....") Fixes: `52dc0595d5` ("modpost: handle relocations mismatch in __ex_table.") Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:33 +09:30
Quentin Casasnovas	e84048aa17	modpost: fix extable entry size calculation. As Guenter pointed out, we were never really calculating the extable entry size because the pointer arithmetic was simply wrong. We want to check we're handling the second relocation in __ex_table to infer an entry size, but we were using (void) pointers instead of Elf_Rel[a] ones. This fixes the problem by moving that check in the caller (since we can deal with different types of relocations) and add is_second_extable_reloc() to make the whole thing more readable. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:32 +09:30
Quentin Casasnovas	d3df4de7eb	modpost: fix inverted logic in is_extable_fault_address(). As Guenter pointed out, we want to assert that extable_entry_size has been discovered and not the other way around. Moreover, this sanity check is only valid when we're not dealing with the first relocation in __ex_table, since we have not discovered the extable entry size at that point. This was leading to a divide-by-zero on some architectures and make the build fail. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:31 +09:30
Rusty Russell	6c730bfc89	modpost: handle -ffunction-sections `52dc0595d5` introduced OTHER_TEXT_SECTIONS for identifying what sections could validly have __ex_table entries. Unfortunately, it wasn't tested with -ffunction-sections, which some architectures use. Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:31 +09:30
Thierry Reding	d7e0abcf4c	modpost: Whitelist .text.fixup and .exception.text 32-bit and 64-bit ARM use these sections to store executable code, so they must be whitelisted in modpost's table of valid text sections. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-04-22 17:31:20 +09:30
Vinod Koul	cdde0e61cf	dmaengine: dw: don't prompt for DW_DMAC_CORE DW_DMAC_CORE is slected by PCI or Platform driver, so this symbol shouldn't be user selectable, so remove the prompt Signed-off-by: Vinod Koul <vinod.koul@intel.com>	2015-04-22 12:24:13 +05:30
Linus Torvalds	db4fd9c5d0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc fixes from David Miller: 1) ldc_alloc_exp_dring() can be called from softints, so use GFP_ATOMIC. From Sowmini Varadhan. 2) Some minor warning/build fixups for the new iommu-common code on certain archs and with certain debug options enabled. Also from Sowmini Varadhan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Use GFP_ATOMIC in ldc_alloc_exp_dring() as it can be called in softirq context sparc64: Use M7 PMC write on all chips T4 and onward. iommu-common: rename iommu_pool_hash to iommu_hash_common iommu-common: fix x86_64 compiler warnings	2015-04-21 23:21:34 -07:00
Linus Torvalds	8aaa51b63c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Just a few fixes trickling in at this point. 1) If we see an attached socket on an skb in the ipv4 forwarding path, bail. This can happen due to races with FIB rule addition, and deletion, and we should just drop such frames. From Sebastian Pöhn. 2) pppoe receive should only accept packets destined for this hosts's MAC address. From Joakim Tjernlund. 3) Handle checksum unwrapping properly in ppp receive properly when it's encapsulated in UDP in some way, fix from Tom Herbert. 4) Fix some bugs in mv88e6xxx DSA driver resulting from the conversion from register offset constants to mnenomic macros. From Vivien Didelot. 5) Fix handling of HCA max message size in mlx4 adapters, from Eran Ben ELisha" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAP tcp: add memory barriers to write space paths altera tse: Error-Bit on tx-avalon-stream always set. net: dsa: mv88e6xxx: use PORT_DEFAULT_VLAN net: dsa: mv88e6xxx: fix setup of port control 1 ppp: call skb_checksum_complete_unset in ppp_receive_frame net: add skb_checksum_complete_unset pppoe: Lacks DST MAC address check ip_forward: Drop frames with attached skb->sk	2015-04-21 22:37:27 -07:00
Olof Johansson	48c1078509	Urgent pull request for v4.1 to booting for custom kernel .config files that do not have MFD_SYSCON set. Omaps now have a dependency to MFD_SYSCON for system control module generic register area and some clocks with the changes done in omap-for-v4.1/prcm-dts branch. This can be pulled on top of omap-for-v4.1/prcm-dts, or into fixes for v4.1. We already do have a slight MFD_SYSCON dependency for REGULATOR_PBIAS for dual voltage MMC cards on the first MMC bus for many devices, so from that point of view this can also be merged separately from omap-for-v4.1/prcm-dts. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJVNTrLAAoJEBvUPslcq6VzfccQALcyghSZwVAz/Rr8xjqCpk+v /ADdRStXjAraT6naS2nyy0ZegOA8c87ZFsWN/SQZbYAIG5/2mn1Ak59q2YHb1UD3 L+qGOrV2Nm1Wund5f9i7NrfkzVBim0tmyKWTzYYZ4/4AGkytILvFcv0oj0M0HJXs 7vig1WgXgDLxSV6SVdkaDQDUw+Ab5l/9qzJMREFt8YjDqx7ZCSg60H/IRd3W+eyn wRPCZ0D0/697iG4NSiq4yJwerHhAmq/EXPFfYBlrPQu8IheHxGKNm2VLH0hBc7aM UDh/eJhFd7Cym7ZMYZq7Ev8tRffZPzZhNsxd/Lu7/GLqtYs+HFUp3ID7bfa1B+X+ p/XXEt6GNyniShHcfJAp34OUhnfsxKD6fgQ5tPYY3ZVGfugiKZcqaGOJnPiDvqQh zc8+1oSel3+BRl1SXtavh4DBjZmbTN0NgaVemSSbVOthnE5DQ3baHTXnnWG+Nb/C C3fa6tV49xGdFDBSEIUWdiGcIiWMobR0RPYMATiU6BRV7FckrfUkwi2xZ7hKdc4E wNfObsclxC8zQxgyth+XGiytxrFU/AHzPCjj7He4bQLRDw4v4f7Z5nb4bvwpOhtI KnQmy1/83T167/JXnQxFb8a46/Eb9m/VM73AJepVER/QVPiQWeJWTv38POQkEhJg VmSC08CsOyPnafcZbQMB =4WLp -----END PGP SIGNATURE----- Merge tag 'omap-for-v4.1/prcm-dts-mfd-syscon-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/late Merge "urgent omap boot fix for v4.1 if MFD_SYSCON is not set" from Tony Lindgren: Urgent pull request for v4.1 to booting for custom kernel .config files that do not have MFD_SYSCON set. Omaps now have a dependency to MFD_SYSCON for system control module generic register area and some clocks with the changes done in omap-for-v4.1/prcm-dts branch. This can be pulled on top of omap-for-v4.1/prcm-dts, or into fixes for v4.1. We already do have a slight MFD_SYSCON dependency for REGULATOR_PBIAS for dual voltage MMC cards on the first MMC bus for many devices, so from that point of view this can also be merged separately from omap-for-v4.1/prcm-dts. * tag 'omap-for-v4.1/prcm-dts-mfd-syscon-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: OMAP2+: Fix booting with configs that don't have MFD_SYSCON Signed-off-by: Olof Johansson <olof@lixom.net>	2015-04-21 21:45:15 -07:00
Chris Bainbridge	6b5eab5469	ACPI / EC: fix NULL pointer dereference in acpi_ec_remove_query_handler() Use list_for_each_entry_safe for iterating because handler may be freed in the loop. BUG: unable to handle kernel NULL pointer dereference at 000000000000002c IP: [<ffffffff814d69c8>] acpi_ec_put_query_handler+0x7/0x1a Call Trace: acpi_ec_remove_query_handler+0x87/0x97 acpi_smbus_hc_remove+0x2a/0x44 [sbshc] acpi_device_remove+0x7b/0x9a __device_release_driver+0x7e/0x110 driver_detach+0xb0/0xc0 bus_remove_driver+0x54/0xe0 driver_unregister+0x2b/0x60 acpi_bus_unregister_driver+0x10/0x12 acpi_smb_hc_driver_exit+0x10/0x12 [sbshc] SyS_delete_module+0x1b8/0x210 system_call_fastpath+0x12/0x6a Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-04-22 04:12:35 +02:00
Linus Torvalds	f614c8178b	Merge branch 'parisc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "The patch by Guenter Roeck fixes the build on parisc which got broken because of commit `f24ffde432` ("parisc: expose number of page table levels on Kconfig level") and the patch from Matthew Wilcox converts our code to use the generic scatterlist.h header file" * 'parisc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Replace PT_NLEVELS with CONFIG_PGTABLE_LEVELS parisc: Eliminate sg_virt_addr() and private scatterlist.h	2015-04-21 17:57:28 -07:00
Eric Mei	9ffc8f7cb9	md/raid5: don't do chunk aligned read on degraded array. When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei <eric.mei@seagate.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	edbe83ab4c	md/raid5: allow the stripe_cache to grow and shrink. The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	5423399a84	md/raid5: change ->inactive_blocked to a bit-flag. This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	486f0644c3	md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
NeilBrown	a9683a795b	md/raid5: pass gfp_t arg to grow_one_stripe() This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	d06f191f8e	md/raid5: introduce configuration option rmw_level Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	584acdd49c	md/raid5: activate raid6 rmw feature Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	a582564b24	md/raid6 algorithms: xor_syndrome() for SSE2 The second and (last) optimized XOR syndrome calculation. This version supports right and left side optimization. All CPUs with architecture older than Haswell will benefit from it. It should be noted that SSE2 movntdq kills performance for memory areas that are read and written simultaneously in chunks smaller than cache line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR functions. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	9a5ce91d05	md/raid6 algorithms: xor_syndrome() for generic int Start the algorithms with the very basic one. It is left and right optimized. That means we can avoid all calculations for unneeded pages above the right stop offset. For pages below the left start offset we still need the syndrome multiplication but without reading data pages. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	7e92e1d762	md/raid6 algorithms: improve test program It is always helpful to have a test tool in place if we implement new data critical algorithms. So add some test routines to the raid6 checker that can prove if the new xor_syndrome() works as expected. Run through all permutations of start/stop pages per algorithm and simulate a xor_syndrome() assisted rmw run. After each rmw check if the recovery algorithm still confirms that the stripe is fine. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	fe5cbc6e06	md/raid6 algorithms: delta syndrome functions v3: s-o-b comment, explanation of performance and descision for the start/stop implementation Implementing rmw functionality for RAID6 requires optimized syndrome calculation. Up to now we can only generate a complete syndrome. The target P/Q pages are always overwritten. With this patch we provide a framework for inplace P/Q modification. In the first place simply fill those functions with NULL values. xor_syndrome() has two additional parameters: start & stop. These will indicate the first and last page that are changing during a rmw run. That makes it possible to avoid several unneccessary loops and speed up calculation. The caller needs to implement the following logic to make the functions work. 1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source blocks inside P/Q between (and including) start and end. 2) modify any block with start <= block <= stop 3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of source blocks into P/Q between (and including) start and end. Pages between start and stop that won't be changed should be filled with a pointer to the kernel zero page. The reasons for not taking NULL pages are: 1) Algorithms cross the whole source data line by line. Thus avoid additional branches. 2) Having a NULL page avoids calculating the XOR P parity but still need calulation steps for the Q parity. Depending on the algorithm unrolling that might be only a difference of 2 instructions per loop. The benchmark numbers of the gen_syndrome() functions are displayed in the kernel log. Do the same for the xor_syndrome() functions. This will help to analyze performance problems and give an rough estimate how well the algorithm works. The choice of the fastest algorithm will still depend on the gen_syndrome() performance. With the start/stop page implementation the speed can vary a lot in real life. E.g. a change of page 0 & page 15 on a stripe will be harder to compute than the case where page 0 & page 1 are XOR candidates. To be not to enthusiatic about the expected speeds we will run a worse case test that simulates a change on the upper half of the stripe. So we do: 1) calculation of P/Q for the upper pages 2) continuation of Q for the lower (empty) pages Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	dabc4ec6ba	raid5: handle expansion/resync case with stripe batching expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	72ac733015	raid5: handle io error of batch list If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	59fc630b8b	RAID5: batch adjacent full stripe write stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	7a87f43405	raid5: track overwrite disk count Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	da41ba6597	raid5: add a new flag to track if a stripe can be batched A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	46d5b78562	raid5: use flex_array for scribble data Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
Heinz Mauelshagen	753f2856cd	md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
NeilBrown	ac8fa4196d	md: allow resync to go faster when there is competing IO. When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	09314799e4	md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	50c37b136a	md: don't require sync_min to be a multiple of chunk_size. There is really no need for sync_min to be a multiple of chunk_size, and values read from here often aren't. That means you cannot read a value and expect to be able to write it back later. So remove the chunk_size check, and round down to a multiple of 4K, to be sure everything works with 4K-sector devices. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	d51e4fe6d6	Merge branch 'cluster' into for-next	2015-04-22 08:00:20 +10:00
Goldwyn Rodrigues	97f6cd39da	md-cluster: re-add capabilities When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	a6da4ef85c	md: re-add a failed disk This adds the capability of re-adding a failed disk by writing "re-add" to /sys/block/mdXX/md/dev-YYY/state. This facilitates adding disks which have encountered a temporary error such as a network disconnection/hiccup in an iSCSI device, or a SAN cable disconnection which has been restored. In such a situation, you do not need to remove and re-add the device. Writing re-add to the failed device's state would add it again to the array and perform the recovery of only the blocks which were written after the device failed. This works for generic md, and is not related to clustering. However, this patch is to ease re-add operations listed above in clustering environments. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	88bcfef7be	md-cluster: remove capabilities This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00

... 2 3 4 5 6 ...

518725 Commits All Branches Search

518725 Commits

All Branches