Commit Graph

21 Commits

Author SHA1 Message Date
Helge Deller e55fcc821d crypto: xor - fix template benchmarking
[ Upstream commit ab9a244c396aae4aaa34b2399b82fc15ec2df8c1 ]

Commit c055e3eae0 ("crypto: xor - use ktime for template benchmarking")
switched from using jiffies to ktime-based performance benchmarking.

This works nicely on machines which have a fine-grained ktime()
clocksource as e.g. x86 machines with TSC.
But other machines, e.g. my 4-way HP PARISC server, don't have such
fine-grained clocksources, which is why it seems that 800 xor loops
take zero seconds, which then shows up in the logs as:

 xor: measuring software checksum speed
    8regs           : -1018167296 MB/sec
    8regs_prefetch  : -1018167296 MB/sec
    32regs          : -1018167296 MB/sec
    32regs_prefetch : -1018167296 MB/sec

Fix this with some small modifications to the existing code to improve
the algorithm to always produce correct results without introducing
major delays for architectures with a fine-grained ktime()
clocksource:
a) Delay start of the timing until ktime() just advanced. On machines
with a fast ktime() this should be just one additional ktime() call.
b) Count the number of loops. Run at minimum 800 loops and finish
earliest when the ktime() counter has progressed.

With that the throughput can now be calculated more accurately under all
conditions.

Fixes: c055e3eae0 ("crypto: xor - use ktime for template benchmarking")
Signed-off-by: Helge Deller <deller@gmx.de>
Tested-by: John David Anglin <dave.anglin@bell.net>

v2:
- clean up coding style (noticed & suggested by Herbert Xu)
- rephrased & fixed typo in commit message

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-04 16:28:49 +02:00
Linus Torvalds 31caf8b2a8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Restrict crypto_cipher to internal API users only.

  Algorithms:
   - Add x86 aesni acceleration for cts.
   - Improve x86 aesni acceleration for xts.
   - Remove x86 acceleration of some uncommon algorithms.
   - Remove RIPE-MD, Tiger and Salsa20.
   - Remove tnepres.
   - Add ARM acceleration for BLAKE2s and BLAKE2b.

  Drivers:
   - Add Keem Bay OCS HCU driver.
   - Add Marvell OcteonTX2 CPT PF driver.
   - Remove PicoXcell driver.
   - Remove mediatek driver"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (154 commits)
  hwrng: timeriomem - Use device-managed registration API
  crypto: hisilicon/qm - fix printing format issue
  crypto: hisilicon/qm - do not reset hardware when CE happens
  crypto: hisilicon/qm - update irqflag
  crypto: hisilicon/qm - fix the value of 'QM_SQC_VFT_BASE_MASK_V2'
  crypto: hisilicon/qm - fix request missing error
  crypto: hisilicon/qm - removing driver after reset
  crypto: octeontx2 - fix -Wpointer-bool-conversion warning
  crypto: hisilicon/hpre - enable Elliptic curve cryptography
  crypto: hisilicon - PASID fixed on Kunpeng 930
  crypto: hisilicon/qm - fix use of 'dma_map_single'
  crypto: hisilicon/hpre - tiny fix
  crypto: hisilicon/hpre - adapt the number of clusters
  crypto: cpt - remove casting dma_alloc_coherent
  crypto: keembay-ocs-aes - Fix 'q' assignment during CCM B0 generation
  crypto: xor - Fix typo of optimization
  hwrng: optee - Use device-managed registration API
  crypto: arm64/crc-t10dif - move NEON yield to C code
  crypto: arm64/aes-ce-mac - simplify NEON yield
  crypto: arm64/aes-neonbs - remove NEON yield calls
  ...
2021-02-21 17:23:56 -08:00
Bhaskar Chowdhury cfb28fde08 crypto: xor - Fix typo of optimization
s/optimzation/optimization/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10 17:55:58 +11:00
Kirill Tkhai 3c02e04fd4 crypto: xor - Fix divide error in do_xor_speed()
crypto: Fix divide error in do_xor_speed()

From: Kirill Tkhai <ktkhai@virtuozzo.com>

Latest (but not only latest) linux-next panics with divide
error on my QEMU setup.

The patch at the bottom of this message fixes the problem.

xor: measuring software checksum speed
divide error: 0000 [#1] PREEMPT SMP KASAN
PREEMPT SMP KASAN
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.10.0-next-20201223+ #2177
RIP: 0010:do_xor_speed+0xbb/0xf3
Code: 41 ff cc 75 b5 bf 01 00 00 00 e8 3d 23 8b fe 65 8b 05 f6 49 83 7d 85 c0 75 05 e8
 84 70 81 fe b8 00 00 50 c3 31 d2 48 8d 7b 10 <f7> f5 41 89 c4 e8 58 07 a2 fe 44 89 63 10 48 8d 7b 08
 e8 cb 07 a2
RSP: 0000:ffff888100137dc8 EFLAGS: 00010246
RAX: 00000000c3500000 RBX: ffffffff823f0160 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000808 RDI: ffffffff823f0170
RBP: 0000000000000000 R08: ffffffff8109c50f R09: ffffffff824bb6f7
R10: fffffbfff04976de R11: 0000000000000001 R12: 0000000000000000
R13: ffff888101997000 R14: ffff888101994000 R15: ffffffff823f0178
FS:  0000000000000000(0000) GS:ffff8881f7780000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000220e000 CR4: 00000000000006a0
Call Trace:
 calibrate_xor_blocks+0x13c/0x1c4
 ? do_xor_speed+0xf3/0xf3
 do_one_initcall+0xc1/0x1b7
 ? start_kernel+0x373/0x373
 ? unpoison_range+0x3a/0x60
 kernel_init_freeable+0x1dd/0x238
 ? rest_init+0xc6/0xc6
 kernel_init+0x8/0x10a
 ret_from_fork+0x1f/0x30
---[ end trace 5bd3c1d0b77772da ]---

Fixes: c055e3eae0 ("crypto: xor - use ktime for template benchmarking")
Cc: <stable@vger.kernel.org>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08 15:37:55 +11:00
Nathan Chancellor 10b0f78a73 crypto: xor - Remove unused variable count in do_xor_speed
Clang warns:

crypto/xor.c:101:4: warning: variable 'count' is uninitialized when used
here [-Wuninitialized]
                        count++;
                        ^~~~~
crypto/xor.c:86:17: note: initialize the variable 'count' to silence
this warning
        int i, j, count;
                       ^
                        = 0
1 warning generated.

After the refactoring to use ktime that happened in this function, count
is only assigned, never read. Just remove the variable to get rid of the
warning.

Fixes: c055e3eae0 ("crypto: xor - use ktime for template benchmarking")
Link: https://github.com/ClangBuiltLinux/linux/issues/1171
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-10-08 16:38:06 +11:00
Ard Biesheuvel c055e3eae0 crypto: xor - use ktime for template benchmarking
Currently, we use the jiffies counter as a time source, by staring at
it until a HZ period elapses, and then staring at it again and perform
as many XOR operations as we can at the same time until another HZ
period elapses, so that we can calculate the throughput. This takes
longer than necessary, and depends on HZ, which is undesirable, since
HZ is system dependent.

Let's use the ktime interface instead, and use it to time a fixed
number of XOR operations, which can be done much faster, and makes
the time spent depend on the performance level of the system itself,
which is much more reasonable. To ensure that we have the resolution
we need even on systems with 32 kHz time sources, while not spending too
much time in the benchmark on a slow CPU, let's switch to 3 attempts of
800 repetitions each: that way, we will only misidentify algorithms that
perform within 10% of each other as the fastest if they are faster than
10 GB/s to begin with, which is not expected to occur on systems with
such coarse clocks.

On ThunderX2, I get the following results:

Before:

  [72625.956765] xor: measuring software checksum speed
  [72625.993104]    8regs     : 10169.000 MB/sec
  [72626.033099]    32regs    : 12050.000 MB/sec
  [72626.073095]    arm64_neon: 11100.000 MB/sec
  [72626.073097] xor: using function: 32regs (12050.000 MB/sec)

After:

  [72599.650216] xor: measuring software checksum speed
  [72599.651188]    8regs           : 10491 MB/sec
  [72599.652006]    32regs          : 12345 MB/sec
  [72599.652871]    arm64_neon      : 11402 MB/sec
  [72599.652873] xor: using function: 32regs (12345 MB/sec)

Link: https://lore.kernel.org/linux-crypto/20200923182230.22715-3-ardb@kernel.org/
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-10-02 18:02:14 +10:00
Ard Biesheuvel 524ccdbdfb crypto: xor - defer load time benchmark to a later time
Currently, the XOR module performs its boot time benchmark at core
initcall time when it is built-in, to ensure that the RAID code can
make use of it when it is built-in as well.

Let's defer this to a later stage during the boot, to avoid impacting
the overall boot time of the system. Instead, just pick an arbitrary
implementation from the list, and use that as the preliminary default.

Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-10-02 18:02:14 +10:00
Thomas Gleixner af1a8899d2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version you should have received a copy of the gnu general
  public license for example usr src linux copying if not write to the
  free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:13 +02:00
Levin, Alexander (Sasha Levin) 75f296d93b kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK
Convert all allocations that used a NOTRACK flag to stop using it.

Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Herbert Xu 27c4d548af crypto: xor - Fix warning when XOR_SELECT_TEMPLATE is unset
This patch fixes an unused label warning triggered when the macro
XOR_SELECT_TEMPLATE is not set.

Fixes: 39457acda9 ("crypto: xor - skip speed test if the xor...")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-08-31 23:00:48 +08:00
Martin Schwidefsky 39457acda9 crypto: xor - skip speed test if the xor function is selected automatically
If the architecture selected the xor function with XOR_SELECT_TEMPLATE
the speed result of the do_xor_speed benchmark is of limited value.
The speed measurement increases the bootup time a little, which can
makes a difference for kernels used in container like virtual machines.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-08-24 21:04:48 +08:00
Jan Beulich af7cf25dd1 add further __init annotations to crypto/xor.c
Allow particularly do_xor_speed() to be discarded post-init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:42:32 +11:00
Linus Torvalds c80ddb5263 md updates for 3.5
Main features:
  - RAID10 arrays can be reshapes - adding and removing devices and
    changing chunks (not 'far' array though)
  - allow RAID5 arrays to be reshaped with a backup file (not tested
    yet, but the priciple works fine for RAID10).
  - arrays can be reshaped while a bitmap is present - you no longer
    need to remove it first
  - SSSE3 support for RAID6 syndrome calculations
 
 and of course a number of minor fixes etc.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT7xXijnsnt1WYoG5AQLvFg/+OGeptY2cRu3HpsNsibvIyfiOYSlDpLo+
 2tYzBz2wFiFROfj41aV/PdeqE3xn/RelDmIgt9Apaimeg453O6IdjI9X00fPrgxV
 ATWkwWy5ykozbLIsyJYQ/kLPo0NX2KR/TtEim2lwlEjs4bLsF8TGvRa6ylcko0zI
 j6cbqVzkCDHXzLk/M6l0UoUaSG1PcjO6M10KBM7bS2sLoxhkn69gT7YTIlFySXW4
 epNYSTKyeuSmEUI7L09s5HLf/zPZSp4MipoRIqQYcwk5gvmMNNuLbouDECvZ5BdV
 TXxrVVSlh7tFSeoGwYXQXcv/nFg3n53Mc+Nimzo7hhmI5ytRR9Y0c6SwvRBCN7t6
 HzapQu+vBqDIPzedH+6r/gk39Auzm60JjGDYHiSdjZCAWefcYUmYm/Iso9JJ/0hg
 PVkSfnkgaFUx0GhXS+C9YgPHYlb5DnTCCMrbtQCL65D61D2det3oZtrQPfKIKMlw
 SRz2Ls+4o4UhAY7JLYNhONa0mtxhk5VTZ3LH58I9+ZurVyvqrjvCV+neSiCUsRog
 jT038/gT5nJ8HPsg5feQ9cS0TbEo92eg3gILy1D5cPTaMZhrV8gq0Ke7xgmBo0+Q
 bWh4vxU9SM/96c/umCxcmHymKAFhsMVFbJTg4r9K5atFGNyMegJYedFFEEbQMQI3
 u+KRDXHN700=
 =q8bc
 -----END PGP SIGNATURE-----

Merge tag 'md-3.5' of git://neil.brown.name/md

Pull md updates from NeilBrown:
 "It's been a busy cycle for md - lots of fun stuff here..  if you like
  this kind of thing :-)

  Main features:
   - RAID10 arrays can be reshaped - adding and removing devices and
     changing chunks (not 'far' array though)
   - allow RAID5 arrays to be reshaped with a backup file (not tested
     yet, but the priciple works fine for RAID10).
   - arrays can be reshaped while a bitmap is present - you no longer
     need to remove it first
   - SSSE3 support for RAID6 syndrome calculations

  and of course a number of minor fixes etc."

* tag 'md-3.5' of git://neil.brown.name/md: (56 commits)
  md/bitmap: record the space available for the bitmap in the superblock.
  md/raid10: Remove extras after reshape to smaller number of devices.
  md/raid5: improve removal of extra devices after reshape.
  md: check the return of mddev_find()
  MD RAID1: Further conditionalize 'fullsync'
  DM RAID: Use md_error() in place of simply setting Faulty bit
  DM RAID: Record and handle missing devices
  DM RAID: Set recovery flags on resume
  md/raid5: Allow reshape while a bitmap is present.
  md/raid10: resize bitmap when required during reshape.
  md: allow array to be resized while bitmap is present.
  md/bitmap: make sure reshape request are reflected in superblock.
  md/bitmap: add bitmap_resize function to allow bitmap resizing.
  md/bitmap: use DIV_ROUND_UP instead of open-code
  md/bitmap: create a 'struct bitmap_counts' substructure of 'struct bitmap'
  md/bitmap: make bitmap bitops atomic.
  md/bitmap: make _page_attr bitops atomic.
  md/bitmap: merge bitmap_file_unmap and bitmap_file_put.
  md/bitmap: remove async freeing of bitmap file.
  md/bitmap: convert some spin_lock_irqsave to spin_lock_irq
  ...
2012-05-23 17:08:40 -07:00
Jim Kukunas 56a519913e crypto: disable preemption while benchmarking RAID5 xor checksumming
With CONFIG_PREEMPT=y, we need to disable preemption while benchmarking
RAID5 xor checksumming to ensure we're actually measuring what we think
we're measuring.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:04 +10:00
Jim Kukunas 6a328475cc crypto: wait for a full jiffy in do_xor_speed
In the existing do_xor_speed(), there is no guarantee that we actually
run do_2() for a full jiffy. We get the current jiffy, then run do_2()
until the next jiffy.

Instead, let's get the current jiffy, then wait until the next jiffy
to start our test.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:03 +10:00
Borislav Petkov d788fec855 crypto, xor: Sanitize checksumming function selection output
Currently, it says

[    1.015541] xor: automatically using best checksumming function: generic_sse
[    1.040769]    generic_sse:  6679.000 MB/sec
[    1.045377] xor: using function: generic_sse (6679.000 MB/sec)

and repeats the function name three times unnecessarily. Change it into

[    1.015115] xor: automatically using best checksumming function:
[    1.040794]    generic_sse:  6680.000 MB/sec

and save us a line in dmesg.

No functional change.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-04-09 15:21:16 +08:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Vegard Nossum 33f65df7ed crypto: don't track xor test pages with kmemcheck
The xor tests are run on uninitialized data, because it is doesn't
really matter what the underlying data is. Annotate this false-
positive warning.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 12:40:10 +02:00
NeilBrown bff61975b3 md: move lots of #include lines out of .h files and into .c
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31 14:33:13 +11:00
Dan Williams 9bc89cd82d async_tx: add the async_tx api
The async_tx api provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies.  It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations.  Code
that is written to the api can optimize for asynchronous operation and the
api will fit the chain of operations to the available offload resources. 
 
	I imagine that any piece of ADMA hardware would register with the
	'async_*' subsystem, and a call to async_X would be routed as
	appropriate, or be run in-line. - Neil Brown

async_tx exploits the capabilities of struct dma_async_tx_descriptor to
provide an api of the following general format:

struct dma_async_tx_descriptor *
async_<operation>(..., struct dma_async_tx_descriptor *depend_tx,
			dma_async_tx_callback cb_fn, void *cb_param)
{
	struct dma_chan *chan = async_tx_find_channel(depend_tx, <operation>);
	struct dma_device *device = chan ? chan->device : NULL;
	int int_en = cb_fn ? 1 : 0;
	struct dma_async_tx_descriptor *tx = device ?
		device->device_prep_dma_<operation>(chan, len, int_en) : NULL;

	if (tx) { /* run <operation> asynchronously */
		...
		tx->tx_set_dest(addr, tx, index);
		...
		tx->tx_set_src(addr, tx, index);
		...
		async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
	} else { /* run <operation> synchronously */
		...
		<operation>
		...
		async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
	}

	return tx;
}

async_tx_find_channel() returns a capable channel from its pool.  The
channel pool is organized as a per-cpu array of channel pointers.  The
async_tx_rebalance() routine is tasked with managing these arrays.  In the
uniprocessor case async_tx_rebalance() tries to spread responsibility
evenly over channels of similar capabilities.  For example if there are two
copy+xor channels, one will handle copy operations and the other will
handle xor.  In the SMP case async_tx_rebalance() attempts to spread the
operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
channel0 while cpu1 gets copy channel 1 and xor channel 1.  When a
dependency is specified async_tx_find_channel defaults to keeping the
operation on the same channel.  A xor->copy->xor chain will stay on one
channel if it supports both operation types, otherwise the transaction will
transition between a copy and a xor resource.

Currently the raid5 implementation in the MD raid456 driver has been
converted to the async_tx api.  A driver for the offload engines on the
Intel Xscale series of I/O processors, iop-adma, is provided in a later
commit.  With the iop-adma driver and async_tx, raid456 is able to offload
copy, xor, and xor-zero-sum operations to hardware engines.
 
On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
improvement) and sequential reads to a degraded array (40 - 55%
improvement).  For the other cases performance was roughly equal, +/- a few
percentage points.  On a x86-smp platform the performance of the async_tx
implementation (in synchronous mode) was also +/- a few percentage points
of the original implementation.  According to 'top' on iop342 CPU
utilization drops from ~50% to ~15% during a 'resync' while the speed
according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.
 
The tiobench command line used for testing was: tiobench --size 2048
--block 4096 --block 131072 --dir /mnt/raid --numruns 5
* iop342 had 1GB of memory available

Details:
* if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
  async_tx_find_channel a static inline routine that always returns NULL
* when a callback is specified for a given transaction an interrupt will
  fire at operation completion time and the callback will occur in a
  tasklet.  if the the channel does not support interrupts then a live
  polling wait will be performed
* the api is written as a dmaengine client that requests all available
  channels
* In support of dependencies the api implicitly schedules channel-switch
  interrupts.  The interrupt triggers the cleanup tasklet which causes
  pending operations to be scheduled on the next channel
* Xor engines treat an xor destination address differently than a software
  xor routine.  To the software routine the destination address is an implied
  source, whereas engines treat it as a write-only destination.  This patch
  modifies the xor_blocks routine to take a an explicit destination address
  to mirror the hardware.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST
* drop dma mapping methods, suggested by Chris Leech
* printk warning fixups from Andrew Morton
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled
* BUG_ON xor source counts <= 1
* implicitly handle hardware concerns like channel switching and
  interrupts, Neil Brown
* remove the per operation type list, and distribute operation capabilities
  evenly amongst the available channels
* simplify async_tx_find_channel to optimize the fast path
* introduce the channel_table_initialized flag to prevent early calls to
  the api
* reorganize the code to mimic crypto
* include mm.h as not all archs include it in dma-mapping.h
* make the Kconfig options non-user visible, Adrian Bunk
* move async_tx under crypto since it is meant as 'core' functionality, and
  the two may share algorithms in the future
* move large inline functions into c files
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:14 -07:00
Dan Williams 685784aaf3 xor: make 'xor_blocks' a library routine for use with async_tx
The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise.  Xor support is
implemented using the raid5 xor routines.  For organizational purposes this
routine is moved to a common area.

The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk

Cc: Adrian Bunk <bunk@stusta.de>
Cc: NeilBrown <neilb@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2007-07-13 08:06:14 -07:00