After TimePoint's precision was increased in LLVM we started seeing
failures because the modification times didn't match. This adds a time
cast to ensure that we're comparing TimePoints with the same amount of
precision.
llvm-svn: 348283
This flag does not exist in GNU objcopy but has a major use case.
Debugging tools support the .build-id directory structure to find
debug binaries. There is no easy way to build this structure up
however. One way to do it is by using llvm-readelf and some crazy
shell magic. This implements the feature directly. It is most often
the case that you'll want to strip a file and send the original to
the .build-id directory but if you just want to send a file to the
.build-id directory you can copy to /dev/null instead.
Differential Revision: https://reviews.llvm.org/D54384
llvm-svn: 348174
Usually local symbols will have their address described in the debug
map. Global symbols have to have their address looked up in the symbol
table of the main executable. By playing with 'ld -r' and export lists,
you can get a symbol described as global by the debug map while actually
being a local symbol as far as the link in concerned. By gathering the
address of local symbols, we fix this issue.
Also, we prefer a global symbol in case of a name collision to preserve
the previous behavior.
Note that using the 'ld -r' tricks, people can actually cause symbol
names collisions that dsymutil has no way to figure out. This fixes the
simple case where there is only one symbol of a given name.
rdar://problem/32826621
Differential revision: https://reviews.llvm.org/D54922
llvm-svn: 348021
This patch removes a (potentially) slow while loop in
DefaultResourceStrategy::select(). A better (and faster) approach is to do some
bit manipulation in order to shrink the range of candidate resources.
On a release build, this change gives an average speedup of ~10%.
llvm-svn: 348007
Summary: The original intention of !Config.xx.empty() was probably to emphasize the thing that is currently considered, but I feel the simplified form is actually easier to understand and it is also consistent with the call sites in other llvm components.
Reviewers: alexshap, rupprecht, jakehehrlich, jhenderson, espindola
Reviewed By: alexshap, rupprecht
Subscribers: emaste, arichardson, llvm-commits
Differential Revision: https://reviews.llvm.org/D55040
llvm-svn: 347891
Summary:
We can sometimes end up with multiple copies of a local variable that
have the same GUID in the index. This happens when there are local
variables with the same name that are in different source files having the
same name/path at compile time (but compiled into different bitcode objects).
In this case make sure we import the copy in the caller's module.
This enables importing both of the variables having the same GUID
(but which will have different promoted names since the module paths,
and therefore the module hashes, will be distinct).
Importing the wrong copy is particularly problematic for read only
variables, since we must import them as a local copy whenever
referenced. Otherwise we get undefs at link time.
Note that the llvm-lto.cpp and ThinLTOCodeGenerator changes are needed
for testing the distributed index case via clang, which will be sent as
a separate clang-side patch shortly. We were previously not doing the
dead code/read only computation before computing imports when testing
distributed index generation (like it was for testing importing and
other ThinLTO mechanisms alone).
Reviewers: evgeny777
Subscribers: mehdi_amini, inglorion, eraman, steven_wu, dexonsmith, dang, llvm-commits
Differential Revision: https://reviews.llvm.org/D55047
llvm-svn: 347886
This patch adds the ability to specify via tablegen which processor resources
are load/store queue resources.
A new tablegen class named MemoryQueue can be optionally used to mark resources
that model load/store queues. Information about the load/store queue is
collected at 'CodeGenSchedule' stage, and analyzed by the 'SubtargetEmitter' to
initialize two new fields in struct MCExtraProcessorInfo named `LoadQueueID` and
`StoreQueueID`. Those two fields are identifiers for buffered resources used to
describe the load queue and the store queue.
Field `BufferSize` is interpreted as the number of entries in the queue, while
the number of units is a throughput indicator (i.e. number of available pickers
for loads/stores).
At construction time, LSUnit in llvm-mca checks for the presence of extra
processor information (i.e. MCExtraProcessorInfo) in the scheduling model. If
that information is available, and fields LoadQueueID and StoreQueueID are set
to a value different than zero (i.e. the invalid processor resource index), then
LSUnit initializes its LoadQueue/StoreQueue based on the BufferSize value
declared by the two processor resources.
With this patch, we more accurately track dynamic dispatch stalls caused by the
lack of LS tokens (i.e. load/store queue full). This is also shown by the
differences in two BdVer2 tests. Stalls that were previously classified as
generic SCHEDULER FULL stalls, are not correctly classified either as "load
queue full" or "store queue full".
About the differences in the -scheduler-stats view: those differences are
expected, because entries in the load/store queue are not released at
instruction issue stage. Instead, those are released at instruction executed
stage. This is the main reason why for the modified tests, the load/store
queues gets full before PdEx is full.
Differential Revision: https://reviews.llvm.org/D54957
llvm-svn: 347857
This reapplies r347767 (originally reviewed at: https://reviews.llvm.org/D55000)
with a fix for the missing std::move of the Error returned by the call to
Pipeline::runCycle().
Below is the original commit message from r347767.
If a user only cares about the overall latency, then the best/quickest way is to
change method Pipeline::run() so that it returns the total number of cycles to
the caller.
When the simulation pipeline is run, the number of cycles (or an error) is
returned from method Pipeline::run().
The advantage is that no hardware event listener is needed for computing that
latency. So, the whole process should be faster (and simpler - at least for that
particular use case).
llvm-svn: 347795
If a user only cares about the overall latency, then the best/quickest way is to
change method Pipeline::run() so that it returns the total number of cycles to
the caller.
When the simulation pipeline is run, the number of cycles (or an error) is
returned from method Pipeline::run().
The advantage is that no hardware event listener is needed for computing that
latency. So, the whole process should be faster (and simpler - at least for that
particular use case).
llvm-svn: 347767
This allows libtool to detect the presence of llvm-strip and use
it with the options --strip-debug and --strip-unneeded.
Also hook up the -V alias for objcopy.
Differential Revision: https://reviews.llvm.org/D54936
llvm-svn: 347731
Summary:
By default sanstats search binaries at the same location where they were when
stats was collected. Sometime you can not print report immediately or you need
to move post-processing to another workstation. To support this use-case when
original binary is missing sanstats will fall-back to directory with sanstats
file.
Reviewers: pcc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53857
llvm-svn: 347601
Summary: Help with off-line symbolization or other type debugging.
Reviewers: pcc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53606
llvm-svn: 347600
By default, llvm-mca conservatively assumes that a register operand from the
variadic sequence is both a register read and a register write. That is because
MCInstrDesc doesn't describe extra variadic operands; we don't have enough
dataflow information to tell which register operands from the variadic sequence
is a definition, and which is a use instead.
However, if a variadic instruction is flagged 'mayStore' (but not 'mayLoad'),
and it has no 'unmodeledSideEffects', then llvm-mca (very) optimistically
assumes that any register operand in the variadic sequence is a register read
only. Conversely, if a variadic instruction is marked as 'mayLoad' (but not
'mayStore'), and it has no 'unmodeledSideEffects', then llvm-mca optimistically
assumes that any extra register operand is a register definition only.
These assumptions work quite well for variadic load/store multiple instructions
defined by the ARM backend.
llvm-svn: 347522
With this change, InstrBuilder emits an error if the MCInst sequence contains an
instruction with a variadic opcode, and a non-zero number of variadic operands.
Currently we don't know how to correctly analyze variadic opcodes. The problem
with variadic operands is that there is no information for them in the opcode
descriptor (i.e. MCInstrDesc). That means, we don't know which variadic operands
are defs, and which are uses.
In future, we could try to conservatively assume that any extra register
operands is both a register use and a register definition.
This patch fixes a subtle bug in the evaluation of read/write operands for ARM
VLD1 with implicit index update. Added test vld1-index-update.s
llvm-svn: 347503
RetireControlUnitStatistics now reports extra information about the ROB and the
avg/maximum number of entries consumed over the entire simulation.
Example:
Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
0, 109 (17.9%)
1, 102 (16.7%)
2, 399 (65.4%)
Total ROB Entries: 64
Max Used ROB Entries: 35 ( 54.7% )
Average Used ROB Entries per cy: 32 ( 50.0% )
Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to
reflect this change.
llvm-svn: 347493
Also, try to minimize the number of queries to the memory queues to speedup the
analysis.
On average, this change gives a small 2% speedup. For memcpy-like kernels, the
speedup is up to 5.5%.
llvm-svn: 347469
This avoids a heap allocation most of the times.
This patch gives a small but consistent 3% speedup on a release build (up to ~5%
on a debug build).
llvm-svn: 347464
This patch fixes an invalid memory read introduced by r346487.
Before this patch, partial register write had to query the latency of the
dependent full register write by calling a method on the full write descriptor.
However, if the full write is from an already retired instruction, chances are
that the EntryStage already reclaimed its memory.
In some parial register write tests, valgrind was reporting an invalid
memory read.
This change fixes the invalid memory access problem. Writes are now responsible
for tracking dependent partial register writes, and notify them in the event of
instruction issued.
That means, partial register writes no longer need to query their associated
full write to check when they are ready to execute.
Added test X86/BtVer2/partial-reg-update-7.s
llvm-svn: 347459
Apply review comments of https://reviews.llvm.org/D54185 to other target as well, specifically:
1. make anonymous namespaces as small as possible, avoid using static inside anonymous namespaces
2. Add missing header to some files
3. GetLoadImmediateOpcodem-> getLoadImmediateOpcode
4. Fix typo
Differential Revision: https://reviews.llvm.org/D54343
llvm-svn: 347309
This will hold flags specific to subprograms. In the future
we could potentially free up scarce bits in DIFlags by moving
subprogram-specific flags from there to the new flags word.
This patch does not change IR/bitcode formats, that will be
done in a follow-up.
Differential Revision: https://reviews.llvm.org/D54597
llvm-svn: 347239
MachOObjectFile::getHostArch() returns a temporary, and getArchName
returns a StringRef pointing to a temporary std::string.
No tests since it doesn't trigger any errors except with the sanitizers.
llvm-svn: 347230
This patch defines an interleaved-load-combine pass. The pass searches
for ShuffleVector instructions that represent interleaved loads. Matches are
converted such that they will be captured by the InterleavedAccessPass.
The pass extends LLVMs capabilities to use target specific instruction
selection of interleaved load patterns (e.g.: ld4 on Aarch64
architectures).
Differential Revision: https://reviews.llvm.org/D52653
llvm-svn: 347208
Summary:
As it was pointed out in D54388+D54390, the maximal size of `Neighbors` is known,
it will contain at most Points_.size() minus one (the center of the cluster)
While that is the upper bound, meaning in the most cases, the actual count
will be much smaller, since D54390 made the allocation persistent,
we no longer have to worry about overly-optimistically `reserve()`ing.
Old: (D54393)
```
Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html' (16 runs):
6553.167456 task-clock (msec) # 1.000 CPUs utilized ( +- 0.21% )
...
6.5547 +- 0.0134 seconds time elapsed ( +- 0.20% )
```
New:
```
Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html' (16 runs):
6315.057872 task-clock (msec) # 0.999 CPUs utilized ( +- 0.24% )
...
6.3187 +- 0.0160 seconds time elapsed ( +- 0.25% )
```
And that is another -~4%.
Since this is the last (as of this moment) patch in this patch series,
it is a good time to summarize:
Old: (svn trunk, as stated in D54381)
```
$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null
real 0m24.884s
user 0m24.099s
sys 0m0.785s
```
So these patches, on a given benchmark,
has decreased llvm-exegesis analysis time by 74.62%.
There surely is more room for further improvements.
D54514 may improve thins by -11.5% more (relative to this patch).
Parallelization may improve things further significantly, too.
Reviewers: courbet, MaskRay, RKSimon, gchatelet, john.brawn
Reviewed By: courbet, MaskRay
Subscribers: tschuett, llvm-commits
Differential Revision: https://reviews.llvm.org/D54415
llvm-svn: 347204
Summary:
Test data: 500kLOC of benchmark.yaml, 23Mb. (that is a subset of the actual uops benchmark i was trying to analyze!)
Old time: (D54381)
```
$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null
real 0m10.487s
user 0m9.745s
sys 0m0.740s
```
New time:
```
$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null
real 0m9.599s
user 0m8.824s
sys 0m0.772s
```
Not that much, around -9%. But that is not the good part yet, again.
Old:
* calls to allocation functions: 3347676
* temporary allocations: 277818
* bytes allocated in total (ignoring deallocations): 10.52 GB
New:
* calls to allocation functions: 2109712 (-36%)
* temporary allocations: 33112 (-88%)
* bytes allocated in total (ignoring deallocations): 4.43 GB (-58% *sic*)
Reviewers: courbet, MaskRay, RKSimon, gchatelet, john.brawn
Reviewed By: courbet, MaskRay
Subscribers: tschuett, llvm-commits
Differential Revision: https://reviews.llvm.org/D54382
llvm-svn: 347198
Summary:
Test data: 500kLOC of benchmark.yaml, 23Mb. (that is a subset of the actual uops benchmark i was trying to analyze!)
Old time:
```
$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null
real 0m24.884s
user 0m24.099s
sys 0m0.785s
```
New time:
```
$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null
real 0m10.469s
user 0m9.797s
sys 0m0.672s
```
So -60%. And that isn't the good bit yet.
Old:
* calls to allocation functions: 106560180 (yes, 107 *million* allocations.)
* bytes allocated in total (ignoring deallocations): 12.17 GB
New:
* calls to allocation functions: 3347676 (-96.86%) (just 3 mil)
* bytes allocated in total (ignoring deallocations): 10.52 GB (~2GB less)
---
Two points i want to raise:
* `std::unordered_set<>` should not have been used there in the first place.
It is banned by the https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options
* There is no tests, so i'm not fully sure this is correct.
Since it was unordered set, i guess there are zero restrictions on the order, and anything will be ok?
* I tried other containers suggested in https://llvm.org/docs/ProgrammersManual.html#set-like-containers-std-set-smallset-setvector-etc,
this `llvm::SetVector<>` seems to be best here.
Reviewers: courbet, MaskRay, RKSimon, gchatelet, john.brawn
Reviewed By: courbet
Subscribers: kristina, bobsayshilol, tschuett, llvm-commits
Differential Revision: https://reviews.llvm.org/D54381
llvm-svn: 347197
Summary:
According to `MaskRay`, use `auto` for type inference, according to coding standards.
Delete some comments, because these comments can be easily inferred from codes.
Reviewers: jhenderson, MaskRay
Reviewed By: jhenderson
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54573
llvm-svn: 346946
Summary:
This adds support for the 'event section' specified in the exception
handling proposal. (This was named 'exception section' first, but later
renamed to 'event section' to take possibilities of other kinds of
events into consideration. But currently we only store exception info in
this section.)
The event section is added between the global section and the export
section. This is for ease of validation per request of the V8 team.
This patch:
- Creates the event symbol type, which is a weak symbol
- Makes 'throw' instruction take the event symbol '__cpp_exception'
- Adds relocation support for events
- Adds WasmObjectWriter / WasmObjectFile (Reader) support
- Adds obj2yaml / yaml2obj support
- Adds '.eventtype' printing support
Reviewers: dschuff, sbc100, aardappel
Subscribers: jgravelle-google, sunfish, llvm-commits
Differential Revision: https://reviews.llvm.org/D54096
llvm-svn: 346825