We are seeing extremely long time in building AMDGPUInstPrinter.cpp
when profile instrumentation is enabled: It takes more than 5 minutes
(compared to ~8 seconds in non-instrument build).
This caused by the huge statements in printInstruction functions. In
profile instrumentation build, we need have extra control flow to
differentiate each case statement. This in turn adds significant
compile time in block placement and branch folding.
Function printInstruction is not likely to benefit from PGO build
as it's rarely executed in a typical compilation. So here I disable
the profile instrumentation for this function.
Differential Revision: https://reviews.llvm.org/D111682
Rather than checking for loop nest preheaders upfront in IVUsers,
move this requirement into isSafeToExpand() from SCEVExpander.
Historically, LSR did not check whether SCEVs are safe to expand
and fully relied on IVUsers to validate this. Later, support for
non-expandable SCEVs was added via rigid formulas.
Checking this in isSafeToExpand() makes it more obvious what
exactly this check is guarding against, and avoids the awkward
loop nest scan.
This is a followup to https://reviews.llvm.org/D111493#3055286.
Differential Revision: https://reviews.llvm.org/D111681
This patch add a TableManager which reponsible for fixing edges that need entries to reference the target symbol and constructing such entries.
In the past, the PerGraphGOTAndPLTStubsBuilder pass was used to build GOT and PLT entry, and the PerGraphTLSInfoEntryBuilder pass was used to build TLSInfo entry. By generalizing the behavior of building entry, I added a TableManager which could be reused when built GOT, PLT and TLSInfo entries.
If this patch makes sense and can be accepted, I will apply the TableManager to other targets(MachO_x86_64, MachO_arm64, ELF_riscv), and delete the file PerGraphGOTAndPLTStubsBuilder.h
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D110383
SimpleRemoteEPC notionally allowed subclasses to override the
createMemoryManager and createMemoryAccess methods to use custom objects, but
could not actually be subclassed in practice (The construction process in
SimpleRemoteEPC::Create could not be re-used).
Instead of subclassing, this commit adds a SimpleRemoteEPC::Setup class that
can be used by clients to set up the memory manager and memory access members.
A default-constructed Setup object results in no change from previous behavior
(EPCGeneric* memory manager and memory access objects used by default).
After 8fc7a907b9, this loop does
the same as a plain `std::replace`.
Also clarify the comment about what this function does.
Differential Revision: https://reviews.llvm.org/D111730
Adds initial parsing and sema for the 'adjust_args' clause.
Note that an AST clause is not created as it instead adds its expressions
to the OMPDeclareVariantAttr.
Differential Revision: https://reviews.llvm.org/D99905
If we have an instruction which produces poison only when flags are specified on the instruction, then we know that freezing the operands and dropping flags is equivalent to freezing the result. If we know those flags don't result in any undefined behavior being executed, then there's no point in preserving the flags as we gain no knowledge by having them.
This patch extends the existing propagation logic which sinks freeze to single potential non-poison operands to allow dropping of flags when we know the freeze is the sole use of the instruction with poison flags.
The main value is that we tend to sink freezes towards the phi in IV cycles where the incoming value to the phi is the freeze of an IV increment. This will in turn (in a future patch), let us fold the freeze through the phi into the loop preheader. Motivated by eliminating need for CanonicalizeFreezeInLoops for the clearly profitable cases from onephi.ll test case in the test directory.
Differential Revision: https://reviews.llvm.org/D111675
If another inlining session came after a ModuleInlinerWrapperPass, the
advisor alanysis would still be cached, but its Result would be cleared.
We need to clear both.
This addresses PR52118
Differential Revision: https://reviews.llvm.org/D111586
This patch continues unblocking optimizations that are blocked by pseudo probe instrumentation.
Not exactly like DbgIntrinsics, PseudoProbe intrinsic has other attributes (such as mayread, maywrite, mayhaveSideEffect) that can block optimizations. The issues fixed are:
- Flipped default param of getFirstNonPHIOrDbg API to skip pseudo probes
- Unblocked CSE by avoiding pseudo probe from clobbering memory SSA
- Unblocked induction variable simpliciation
- Allow empty loop deletion by treating probe intrinsic isDroppable
- Some refactoring.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D110847
This patch adds a new cost heuristic that allows peeling a single
iteration off read-only loops, if the loop contains a load that
1. is feeding an exit condition,
2. dominates the latch,
3. is not already known to be dereferenceable,
4. and has a loop invariant address.
If all non-latch exits are terminated with unreachable, such loads
in the loop are guaranteed to be dereferenceable after peeling,
enabling hoisting/CSE'ing them.
This enables vectorization of loops with certain runtime-checks, like
multiple calls to `std::vector::at` if the vector is passed as pointer.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D108114
There is no code that uses this base class yet, hence the typo went
unnoticed when this class was added in D105871
Differential Revision: https://reviews.llvm.org/D110930
Rename vfredsum and vfwredsum to vfredusum and vfwredusum. Add aliases for vfredsum and vfwredsum.
Reviewed By: luismarques, HsiangKai, khchen, frasercrmck, kito-cheng, craig.topper
Differential Revision: https://reviews.llvm.org/D105690
Adds explicit narrowing casts to JITLinkMemoryManager.cpp.
Honors -slab-address option in llvm-jitlink.cpp, which was accidentally
dropped in the refactor.
This effectively reverts commit 6641d29b70.
This commit substantially refactors the JITLinkMemoryManager API to: (1) add
asynchronous versions of key operations, (2) give memory manager implementations
full control over link graph address layout, (3) enable more efficient tracking
of allocated memory, and (4) support "allocation actions" and finalize-lifetime
memory.
Together these changes provide a more usable API, and enable more powerful and
efficient memory manager implementations.
To support these changes the JITLinkMemoryManager::Allocation inner class has
been split into two new classes: InFlightAllocation, and FinalizedAllocation.
The allocate method returns an InFlightAllocation that tracks memory (both
working and executor memory) prior to finalization. The finalize method returns
a FinalizedAllocation object, and the InFlightAllocation is discarded. Breaking
Allocation into InFlightAllocation and FinalizedAllocation allows
InFlightAllocation subclassses to be written more naturally, and FinalizedAlloc
to be implemented and used efficiently (see (3) below).
In addition to the memory manager changes this commit also introduces a new
MemProt type to represent memory protections (MemProt replaces use of
sys::Memory::ProtectionFlags in JITLink), and a new MemDeallocPolicy type that
can be used to indicate when a section should be deallocated (see (4) below).
Plugin/pass writers who were using sys::Memory::ProtectionFlags will have to
switch to MemProt -- this should be straightworward. Clients with out-of-tree
memory managers will need to update their implementations. Clients using
in-tree memory managers should mostly be able to ignore it.
Major features:
(1) More asynchrony:
The allocate and deallocate methods are now asynchronous by default, with
synchronous convenience wrappers supplied. The asynchronous versions allow
clients (including JITLink) to request and deallocate memory without blocking.
(2) Improved control over graph address layout:
Instead of a SegmentRequestMap, JITLinkMemoryManager::allocate now takes a
reference to the LinkGraph to be allocated. The memory manager is responsible
for calculating the memory requirements for the graph, and laying out the graph
(setting working and executor memory addresses) within the allocated memory.
This gives memory managers full control over JIT'd memory layout. For clients
that don't need or want this degree of control the new "BasicLayout" utility can
be used to get a segment-based view of the graph, similar to the one provided by
SegmentRequestMap. Once segment addresses are assigned the BasicLayout::apply
method can be used to automatically lay out the graph.
(3) Efficient tracking of allocated memory.
The FinalizedAlloc type is a wrapper for an ExecutorAddr and requires only
64-bits to store in the controller. The meaning of the address held by the
FinalizedAlloc is left up to the memory manager implementation, but the
FinalizedAlloc type enforces a requirement that deallocate be called on any
non-default values prior to destruction. The deallocate method takes a
vector<FinalizedAlloc>, allowing for bulk deallocation of many allocations in a
single call.
Memory manager implementations will typically store the address of some
allocation metadata in the executor in the FinalizedAlloc, as holding this
metadata in the executor is often cheaper and may allow for clean deallocation
even in failure cases where the connection with the controller is lost.
(4) Support for "allocation actions" and finalize-lifetime memory.
Allocation actions are pairs (finalize_act, deallocate_act) of JITTargetAddress
triples (fn, arg_buffer_addr, arg_buffer_size), that can be attached to a
finalize request. At finalization time, after memory protections have been
applied, each of the "finalize_act" elements will be called in order (skipping
any elements whose fn value is zero) as
((char*(*)(const char *, size_t))fn)((const char *)arg_buffer_addr,
(size_t)arg_buffer_size);
At deallocation time the deallocate elements will be run in reverse order (again
skipping any elements where fn is zero).
The returned char * should be null to indicate success, or a non-null
heap-allocated string error message to indicate failure.
These actions allow finalization and deallocation to be extended to include
operations like registering and deregistering eh-frames, TLS sections,
initializer and deinitializers, and language metadata sections. Previously these
operations required separate callWrapper invocations. Compared to callWrapper
invocations, actions require no extra IPC/RPC, reducing costs and eliminating
a potential source of errors.
Finalize lifetime memory can be used to support finalize actions: Sections with
finalize lifetime should be destroyed by memory managers immediately after
finalization actions have been run. Finalize memory can be used to support
finalize actions (e.g. with extra-metadata, or synthesized finalize actions)
without incurring permanent memory overhead.
In SimpleRemoteEPC, calls to from callWrapperAsync to sendMessage may fail.
The handlers may or may not be sent failure messages by handleDisconnect,
depending on when that method is run. This patch adds a check for an un-failed
handler, and if it finds one sends it a failure message.
Returned out-of-band errors should be wrapped as llvm::Errors and passed to the
SendDeserializedResult function. Failure to do this results in an assertion when
we try to deserialize from the WrapperFunctionResult while it's in the
out-of-band error state.
TypeSwitch.h is used pervasively in MLIR and often has dozens of types switched
over. It uses "zero cost" variadic templates to implement the dispatching
mechanism... which isn't zero cost in debug builds, and which causes a massive
problem for actually debugging things that use it - you get dozens of nonsense
frames in the debugger for simple things like a visitor.
Fix this by marking the key method in TypeSwitch as nodebug + alwaysinline.
This resolves LLVM PR49301
Differential Revision: https://reviews.llvm.org/D111520
On the controller-side, handle `Hangup` messages from the executor. The executor passed `Error::success()` or a failure message as payload.
Hangups cause an immediate disconnect of the transport layer. The disconnect function may be called later again and so implementations should be prepared. `FDSimpleRemoteEPCTransport::disconnect()` already has a flag to check that:
cd1bd95d87/llvm/lib/ExecutionEngine/Orc/Shared/SimpleRemoteEPCUtils.cpp (L112)
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D111527
This reverts commits f9aba9a5af and
035217ff51.
As explained in the original commit message, this didn't have the
intended effect of improving the common LLDB use case, but still
provided a marginal improvement for the places where LLDB creates a
scoped time with a string literal.
The reason for the revert is that this change pulls in the os/signpost.h
header in Signposts.h. The former transitively includes loader.h, which
contains a series of macro defines that conflict with MachO.h. There are
ways to work around that, but Adrian and I concluded that none of them
are worth the trade-off in complicating Signposts.h even further.
As a brief reminder, an "exit count" is the number of times the backedge executes before some event. It can be zero if we exit before the backedge is reached. A "trip count" is the number of times the loop header is entered if we branch into the loop. In general, TC = BTC + 1 and thus a zero trip count is ill defined
There is a cornercases which we don't handle well. Let's assume i8 for our examples to keep things simple. If BTC = 255, then the correct trip count is 256. However, 256 is not representable in i8.
In theory, code which needs to reason about trip counts is responsible for checking for this cornercase, and either bailing out, or handling it correctly. Historically, we don't have a great track record about actually doing so.
When reviewing D109676, I found myself asking a basic question. Was there any good reason to preserve the current wrap-to-zero behavior when converting from backedge taken counts to trip counts? After reviewing existing code, I could not find a single case which appears to correctly and precisely handle the overflow case.
This patch changes the default behavior to extend instead of wrap. That is, if the result might be 256, we return a value of i9 type to ensure we interpret the count correctly. I did leave the legacy behavior as an option since a) loop-flatten stops triggering if I extend due to weirdly specific pattern matching I didn't understand and b) we could reasonably use the mode if we'd externally established a lack of overflow.
I want to emphasize that this change is *not* NFC. There are two call sites (one in ScalarEvolution.cpp, one in LoopCacheAnalysis.cpp) which are switched to the extend semantics. The former appears imprecise (but correct) for a constant 255 BTC. The later appears incorrect, though I don't have a test case.
Differential Revision: https://reviews.llvm.org/D110587
armv9-a, armv9.1-a and armv9.2-a can be targeted using the -march option
both in ARM and AArch64.
- Armv9-A maps to Armv8.5-A.
- Armv9.1-A maps to Armv8.6-A.
- Armv9.2-A maps to Armv8.7-A.
- The SVE2 extension is enabled by default on these architectures.
- The cryptographic extensions are disabled by default on these
architectures.
The Armv9-A architecture is described in the Arm® Architecture Reference
Manual Supplement Armv9, for Armv9-A architecture profile
(https://developer.arm.com/documentation/ddi0608/latest).
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D109517
Adds LLVMOrcCreateStaticLibrarySearchGeneratorForPath and
LLVMOrcCreateDynamicLibrarySearchGeneratorForPath functions to create generators
for static and dynamic libraries.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D108535
The Object library currently has three identical functions that translate a
Twine into a parser error. Until recently these functions have coexisted
peacefully, but since D110320 Clang with enabled modules is now diagnosing that
we have several definitions of `createError` in Object.
This patch just merges them all and puts them into Object's `Error.h` where the
error code for `parse_failed` is also defined which seems cleaner and unbreaks
the bots.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D111541
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
The callWrapperAsync and callSPSWrapperAsync methods take a handler object
that is run on the return value of the call when it is ready. The new RunPolicy
parameters allow clients to control how these handlers are run. If no policy is
specified then the handler will be packaged as a GenericNamedTask and dispatched
using the ExecutorProcessControl's TaskDispatch member. Callers can use the
ExecutorProcessControl::RunInPlace policy to cause the handler to be run
directly instead, which may be preferrable for simple handlers, or they can
write their own policy object (e.g. to dispatch as some other kind of Task,
rather than GenericNamedTask).
ExecutorProcessControl objects will now have a TaskDispatcher member which
should be used to dispatch work (in particular, handling incoming packets in
the implementation of remote EPC implementations like SimpleRemoteEPC).
The GenericNamedTask template can be used to wrap function objects that are
callable as 'void()' (along with an optional name to describe the task).
The makeGenericNamedTask functions can be used to create GenericNamedTask
instances without having to name the function object type.
In a future patch ExecutionSession will be updated to use the
ExecutorProcessControl's dispatcher, instead of its DispatchTaskFunction.
The callee address is now the first parameter and the 'SendResult' function
the second. This change improves consistentency with the non-async functions
where the callee is the first address and the return value the second.
This adds the `--dump-blockinfo` flag to `llvm-bcanalyzer`, allowing a sufficiently motivated user to dump (parts of) the `BLOCKINFO_BLOCK` block. The default behavior is unchanged, and `--dump-blockinfo` only takes effect in the same context as other flags that control dump behavior (i.e., requires that `--dump` is also passed).
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D107536