llvm-project/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

95 lines
3.0 KiB
C
Raw Normal View History

//===-- BenchmarkRunner.h ---------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
///
/// \file
/// Defines the abstract BenchmarkRunner class for measuring a certain execution
/// property of instructions (e.g. latency).
///
//===----------------------------------------------------------------------===//
#ifndef LLVM_TOOLS_LLVM_EXEGESIS_BENCHMARKRUNNER_H
#define LLVM_TOOLS_LLVM_EXEGESIS_BENCHMARKRUNNER_H
#include "Assembler.h"
#include "BenchmarkCode.h"
#include "BenchmarkResult.h"
#include "LlvmState.h"
#include "MCInstrDescView.h"
#include "SnippetRepetitor.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/MC/MCInst.h"
#include "llvm/Support/Error.h"
#include <cstdlib>
#include <memory>
#include <vector>
namespace llvm {
namespace exegesis {
// Common code for all benchmark modes.
class BenchmarkRunner {
public:
explicit BenchmarkRunner(const LLVMState &State,
InstructionBenchmark::ModeE Mode);
virtual ~BenchmarkRunner();
Expected<InstructionBenchmark>
runConfiguration(const BenchmarkCode &Configuration, unsigned NumRepetitions,
[llvm-exegesis] Loop unrolling for loop snippet repetitor mode I really needed this, like, factually, yesterday, when verifying dependency breaking idioms for AMD Zen 3 scheduler model. Consider the following example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 } error: '' info: '' assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3 ... ``` What does it tell us? So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle? That doesn't seem right. That's even less than there are pipes supporting this type of op. Now, second example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 } error: '' info: '' assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3 ... ``` Now that's just worse. Due to the looping, the throughput completely plummeted, and now we can only do a single instruction/cycle!? That's not great. And final example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000 Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 } error: '' info: '' assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3 ... ``` So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x (loop-body-size/instruction count in snippet), and run a loop with 1000 iterations over that duplicated/unrolled snippet, the measured throughput goes through the roof, up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle! Reviewed By: courbet Differential Revision: https://reviews.llvm.org/D102522
2021-05-25 16:48:43 +08:00
unsigned LoopUnrollFactor,
[llvm-exegesis] 'Min' repetition mode Summary: As noted in documentation, different repetition modes have different trade-offs: > .. option:: -repetition-mode=[duplicate|loop] > > Specify the repetition mode. `duplicate` will create a large, straight line > basic block with `num-repetitions` copies of the snippet. `loop` will wrap > the snippet in a loop which will be run `num-repetitions` times. The `loop` > mode tends to better hide the effects of the CPU frontend on architectures > that cache decoded instructions, but consumes a register for counting > iterations. Indeed. Example: >>! In D74156#1873657, @lebedev.ri wrote: > At least for `CMOV`, i'm seeing wildly different results > | | Latency | RThroughput | > | duplicate | 1 | 0.8 | > | loop | 2 | 0.6 | > where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units). This isn't great for analysis, at least for schedule model development. As discussed in excruciating detail in >>! In D74156#1924514, @gchatelet wrote: >>>! In D74156#1920632, @lebedev.ri wrote: >> ... did that explanation of the question i'm having made any sense? > > Thx for digging in the conversation ! > Ok it makes more sense now. > > I discussed it a bit with @courbet: > - We want the analysis tool to stay simple so we'd rather not make it knowledgeable of the repetition mode. > - We'd like to still be able to select either repetition mode to dig into special cases > > So we could add a third `min` repetition mode that would run both and take the minimum. It could be the default option. > Would you have some time to look what it would take to add this third mode? there appears to be an agreement that it is indeed sub-par, and that we should provide an optional, measurement (not analysis!) -time way to rectify the situation. However, the solutions isn't entirely straight-forward. We can just add an actual 'multiplexer' `MinSnippetRepetitor`, because if we just concatenate snippets produced by `DuplicateSnippetRepetitor` and `LoopSnippetRepetitor` and run+measure that, the measurement will naturally be different from what we'd get by running+measuring them separately and taking the min. ([[ https://www.wolframalpha.com/input/?i=%28x%2By%29%2F2+%21%3D+min%28x%2C+y%29 | `time(D+L)/2 != min(time(D), time(L))` ]]) Also, it seems best to me to have a single snippet instead of generating a snippet per repetition mode, since the only difference here is that the loop repetition mode reserves one register for loop counter. As far as i can tell, we can either teach `BenchmarkRunner::runConfiguration()` to produce a single report given multiple repetitors (as in the patch), or do that one layer higher - don't modify `BenchmarkRunner::runConfiguration()`, produce multiple reports, don't actually print each one, but aggregate them somehow and only print the final one. Initially i've gone ahead with the latter approach, but it didn't look like a natural fit; the former (as in the diff) does seem like a better fit to me. There's also a question of the test coverage. It sure currently does work here: ``` $ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=duplicate Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-8fb949.o --- mode: inverse_throughput key: instructions: - 'CMOV64rr RAX RAX R11 i_0x0' - 'CMOV64rr RBP RBP R15 i_0x0' - 'CMOV64rr RBX RBX RBX i_0x0' - 'CMOV64rr RCX RCX RBX i_0x0' - 'CMOV64rr RDI RDI R10 i_0x0' - 'CMOV64rr RDX RDX RAX i_0x0' - 'CMOV64rr RSI RSI RAX i_0x0' - 'CMOV64rr R8 R8 R8 i_0x0' - 'CMOV64rr R9 R9 RDX i_0x0' - 'CMOV64rr R10 R10 RBX i_0x0' - 'CMOV64rr R11 R11 R14 i_0x0' - 'CMOV64rr R12 R12 R9 i_0x0' - 'CMOV64rr R13 R13 R12 i_0x0' - 'CMOV64rr R14 R14 R15 i_0x0' - 'CMOV64rr R15 R15 R13 i_0x0' config: '' register_initial_values: - 'RAX=0x0' - 'R11=0x0' - 'EFLAGS=0x0' - 'RBP=0x0' - 'R15=0x0' - 'RBX=0x0' - 'RCX=0x0' - 'RDI=0x0' - 'R10=0x0' - 'RDX=0x0' - 'RSI=0x0' - 'R8=0x0' - 'R9=0x0' - 'R14=0x0' - 'R12=0x0' - 'R13=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: inverse_throughput, value: 0.819, per_snippet_value: 12.285 } error: '' info: instruction has tied variables, using static renaming. assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BF000000000000000048BB000000000000000048B9000000000000000048BF000000000000000049BA000000000000000048BA000000000000000048BE000000000000000049B8000000000000000049B9000000000000000049BE000000000000000049BC000000000000000049BD0000000000000000490F40C3490F40EF480F40DB480F40CB490F40FA480F40D0480F40F04D0F40C04C0F40CA4C0F40D34D0F40DE4D0F40E14D0F40EC4D0F40F74D0F40FD490F40C35B415C415D415E415F5DC3 ... $ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=loop Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-051eb3.o --- mode: inverse_throughput key: instructions: - 'CMOV64rr RAX RAX R11 i_0x0' - 'CMOV64rr RBP RBP RSI i_0x0' - 'CMOV64rr RBX RBX R9 i_0x0' - 'CMOV64rr RCX RCX RSI i_0x0' - 'CMOV64rr RDI RDI RBP i_0x0' - 'CMOV64rr RDX RDX R9 i_0x0' - 'CMOV64rr RSI RSI RDI i_0x0' - 'CMOV64rr R9 R9 R12 i_0x0' - 'CMOV64rr R10 R10 R11 i_0x0' - 'CMOV64rr R11 R11 R9 i_0x0' - 'CMOV64rr R12 R12 RBP i_0x0' - 'CMOV64rr R13 R13 RSI i_0x0' - 'CMOV64rr R14 R14 R14 i_0x0' - 'CMOV64rr R15 R15 R10 i_0x0' config: '' register_initial_values: - 'RAX=0x0' - 'R11=0x0' - 'EFLAGS=0x0' - 'RBP=0x0' - 'RSI=0x0' - 'RBX=0x0' - 'R9=0x0' - 'RCX=0x0' - 'RDI=0x0' - 'RDX=0x0' - 'R12=0x0' - 'R10=0x0' - 'R13=0x0' - 'R14=0x0' - 'R15=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: inverse_throughput, value: 0.6083, per_snippet_value: 8.5162 } error: '' info: instruction has tied variables, using static renaming. assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000048BE000000000000000048BB000000000000000049B9000000000000000048B9000000000000000048BF000000000000000048BA000000000000000049BC000000000000000049BA000000000000000049BD000000000000000049BE000000000000000049BF000000000000000049B80200000000000000490F40C3480F40EE490F40D9480F40CE480F40FD490F40D1480F40F74D0F40CC4D0F40D34D0F40D94C0F40E54C0F40EE4D0F40F64D0F40FA4983C0FF75C25B415C415D415E415F5DC3 ... $ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=min Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c7a47d.o Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2581f1.o --- mode: inverse_throughput key: instructions: - 'CMOV64rr RAX RAX R11 i_0x0' - 'CMOV64rr RBP RBP R10 i_0x0' - 'CMOV64rr RBX RBX R10 i_0x0' - 'CMOV64rr RCX RCX RDX i_0x0' - 'CMOV64rr RDI RDI RAX i_0x0' - 'CMOV64rr RDX RDX R9 i_0x0' - 'CMOV64rr RSI RSI RAX i_0x0' - 'CMOV64rr R9 R9 RBX i_0x0' - 'CMOV64rr R10 R10 R12 i_0x0' - 'CMOV64rr R11 R11 RDI i_0x0' - 'CMOV64rr R12 R12 RDI i_0x0' - 'CMOV64rr R13 R13 RDI i_0x0' - 'CMOV64rr R14 R14 R9 i_0x0' - 'CMOV64rr R15 R15 RBP i_0x0' config: '' register_initial_values: - 'RAX=0x0' - 'R11=0x0' - 'EFLAGS=0x0' - 'RBP=0x0' - 'R10=0x0' - 'RBX=0x0' - 'RCX=0x0' - 'RDX=0x0' - 'RDI=0x0' - 'R9=0x0' - 'RSI=0x0' - 'R12=0x0' - 'R13=0x0' - 'R14=0x0' - 'R15=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: inverse_throughput, value: 0.6073, per_snippet_value: 8.5022 } error: '' info: instruction has tied variables, using static renaming. assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BA000000000000000048BB000000000000000048B9000000000000000048BA000000000000000048BF000000000000000049B9000000000000000048BE000000000000000049BC000000000000000049BD000000000000000049BE000000000000000049BF0000000000000000490F40C3490F40EA490F40DA480F40CA480F40F8490F40D1480F40F04C0F40CB4D0F40D44C0F40DF4C0F40E74C0F40EF4D0F40F14C0F40FD490F40C3490F40EA5B415C415D415E415F5DC35541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BA000000000000000048BB000000000000000048B9000000000000000048BA000000000000000048BF000000000000000049B9000000000000000048BE000000000000000049BC000000000000000049BD000000000000000049BE000000000000000049BF000000000000000049B80200000000000000490F40C3490F40EA490F40DA480F40CA480F40F8490F40D1480F40F04C0F40CB4D0F40D44C0F40DF4C0F40E74C0F40EF4D0F40F14C0F40FD4983C0FF75C25B415C415D415E415F5DC3 ... ``` but i open to suggestions as to how test that. I also have gone with the suggestion to default to this new mode. This was irking me for some time, so i'm happy to finally see progress here. Looking forward to feedback. Reviewers: courbet, gchatelet Reviewed By: courbet, gchatelet Subscribers: mstojanovic, RKSimon, llvm-commits, courbet, gchatelet Tags: #llvm Differential Revision: https://reviews.llvm.org/D76921
2020-04-02 14:28:35 +08:00
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
bool DumpObjectToDisk) const;
// Scratch space to run instructions that touch memory.
struct ScratchSpace {
static constexpr const size_t kAlignment = 1024;
static constexpr const size_t kSize = 1 << 20; // 1MB.
ScratchSpace()
: UnalignedPtr(std::make_unique<char[]>(kSize + kAlignment)),
AlignedPtr(
UnalignedPtr.get() + kAlignment -
(reinterpret_cast<intptr_t>(UnalignedPtr.get()) % kAlignment)) {}
char *ptr() const { return AlignedPtr; }
void clear() { std::memset(ptr(), 0, kSize); }
private:
const std::unique_ptr<char[]> UnalignedPtr;
char *const AlignedPtr;
};
// A helper to measure counters while executing a function in a sandboxed
// context.
class FunctionExecutor {
public:
virtual ~FunctionExecutor();
// FIXME deprecate this.
virtual Expected<int64_t> runAndMeasure(const char *Counters) const = 0;
virtual Expected<llvm::SmallVector<int64_t, 4>>
runAndSample(const char *Counters) const = 0;
};
protected:
const LLVMState &State;
const InstructionBenchmark::ModeE Mode;
private:
virtual Expected<std::vector<BenchmarkMeasure>>
runMeasurements(const FunctionExecutor &Executor) const = 0;
Expected<std::string> writeObjectFile(const BenchmarkCode &Configuration,
const FillFunction &Fill) const;
const std::unique_ptr<ScratchSpace> Scratch;
};
} // namespace exegesis
} // namespace llvm
#endif // LLVM_TOOLS_LLVM_EXEGESIS_BENCHMARKRUNNER_H