Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This change reverts most of commit
60724ca59e [MIPS: IP checksums: Remove
unncessary .set pseudos] that introduced warnings with the
CPU_DADDI_WORKAROUNDS option set:
arch/mips/lib/csum_partial.S: Assembler messages:
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:467: Warning: used $3 with ".set at=$3"
[...]
arch/mips/lib/csum_partial.S:577: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:577: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:577: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:601: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:601: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:601: Warning: used $3 with ".set at=$3"
arch/mips/lib/csum_partial.S:601: Warning: used $3 with ".set at=$3"
[and so on, and so on...]
The warnings are benign and good code is produced regardless because no
macros that'd use the assembler's temporary register are involved, however
the `.set noat' directives removed by the commit referred are crucial to
guarantee this is still going to be the case after any changes in the
future. Therefore they need to be brought back to place which this
change does.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/6686/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
In preparation for EVA support, we use a macro to build the
__csum_partial_copy_user main code so it can be shared across
multiple implementations. EVA uses the same code but it replaces
the load/store/prefetch instructions with the EVA specific ones
therefore using a macro avoids unnecessary code duplications.
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Each load/store macro always adds an entry to the __ex_table
using the EXC macro. There are cases where a load instruction may
never fail such as when we are sure the load happens in the kernel
address space. Therefore, we merge these the EXC and LOADX/STOREX
macros into a single one. We also expand the argument list in the EXC
macro to make the macro more flexible. The extra 'type' argument is not
used by this commit, but it will be used when EVA support is added to
memcpy.
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
The 'copy_user' symbol can be used to copy from or to
userland so we will use two different symbols for these
operations. This makes no difference in the existing code,
but when the core is operating in EVA mode, different instructions
need to be used to read and write to userland address space.
The old function has also been renamed to 'copy_kernel' to denote
that it is suitable for copy data to and from kernel space.
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
The csum_partial implementation contain optimalizations for the MIPS R2
instruction set. This optimization is never enabled however because the
if directive uses the CPU_MIPSR2 constant which is not defined anywhere.
Use the CONFIG_CPU_MIPSR2 constant instead.
Signed-off-by: Gabor Juhos <juhosg@openwrt.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/4971/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Having received another series of whitespace patches I decided to do this
once and for all rather than dealing with this kind of patches trickling
in forever.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Use unsigned loads to avoid possible misscalculation of IP checksums. This
bug was instruced in f761106cd728bcf65b7fe161b10221ee00cf7132 (lmo) /
ed99e2bc1d (kernel.org).
[Original fix by Atsushi. Improved instruction scheduling and fix for
unaligned unsigned load by me -- Ralf]
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
These symbols appear in oprofile output, stacktraces and similar but only
make the output harder to read. Many identical symbol names such as
"both_aligned" were also being used in multiple source files making it
impossible to see which file actually was meant. So let's get rid of them.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This complements the generic R4000/R4400 errata workaround code and adds
bits for the daddiu problem. In most places it just modifies handwritten
assembly code so that the assembler is allowed to use a temporary register
as daddiu may now be treated as a macro that expands to a sequence of li
and daddu. It is the AT register or, where AT is unavailable or used
explicitly for another purpose, an explicitly-named register is selected,
using the .set at=<reg> feature added recently to gas. This feature is
only used if CONFIG_CPU_DADDI_WORKAROUNDS has been set, so if the
workaround remains disabled, the required version of binutils stays
unchanged.
Similarly, daddiu instructions put in branch delay slots in noreorder
fragments are now taken out of them and the assembler is allowed to
reorder them itself as possible (which it does making the whole idea of
scheduling them into delay slots manually questionable).
Also in the very few places where such a simple conversion was not
possible, a handcoded longer sequence is implemented.
Other than that there are changes to code responsible for building the
TLB fault and page clear/copy handlers to avoid daddiu as appropriate.
These are only effective if the erratum is verified to be present at the
run time.
Finally there is a trivial update to __delay(), because it uses daddiu in
a branch delay slot.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Implement optimized asm version of csum_partial_copy_nocheck,
csum_partial_copy_from_user and csum_and_copy_to_user which can do
calculate and copy in parallel, based on memcpy.S.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Delete dead codes at end of the function and move small_csumcopy
there. This makes some labels (maybe_end_cruft, small_memcpy,
end_bytes, out) needless and eliminates some branches.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Use standard o32 register name instead of T0, T1, etc, like memcpy.S.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The 32-bit version and 64-bit version are almost equal. Unify them. This
makes further improvements (for example, copying with parallel, supporting
PREFETCH, etc.) easier.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>