Commit Graph

205 Commits

Author SHA1 Message Date
Xiang1 Zhang d4bdeca576 [X86] Support AMX fast register allocation
Differential Revision: https://reviews.llvm.org/D100026
2021-05-08 14:21:11 +08:00
Xiang1 Zhang bebafe01a7 Revert "[X86] Support AMX fast register allocation"
This reverts commit 77e2e5e07d.
2021-05-08 13:43:32 +08:00
Xiang1 Zhang 77e2e5e07d [X86] Support AMX fast register allocation 2021-05-08 13:27:21 +08:00
Benjamin Kramer df323ba445 Revert "[X86] Support AMX fast register allocation"
This reverts commit 3b8ec86fd5.

Revert "[X86] Refine AMX fast register allocation"

This reverts commit c3f95e9197.

This pass breaks using LLVM in a multi-threaded environment by
introducing global state.
2021-04-29 18:56:33 +02:00
Xiang1 Zhang 3b8ec86fd5 [X86] Support AMX fast register allocation
Differential Revision: https://reviews.llvm.org/D100026
2021-04-25 09:45:41 +08:00
Bing1 Yu 4f198b0c27 [X86] Pass to transform amx intrinsics to scalar operation.
This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
 should improve fast register allocation to allocate amx register.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D93594
2021-03-16 10:40:22 +08:00
Simon Pilgrim 3fd2fa1220 Revert rG8198d83965ba4b9db6922b44ef3041030b2bac39: "[X86] Pass to transform amx intrinsics to scalar operation."
This reverts commit 8198d83965ba4b9db6922b44ef3041030b2bac39.due to buildbot breakages
2021-03-05 11:09:14 +00:00
Luo, Yuanke 8198d83965 [X86] Pass to transform amx intrinsics to scalar operation.
This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
 should improve fast register allocation to allocate amx register.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D93594
2021-03-05 16:02:02 +08:00
Luo, Yuanke 8f48ddd193 [X86][AMX] Lower tile copy instruction.
Since there is no tile copy instruction, we need to store tile
register to stack and load from stack to another tile register.
We need extra GR to hold the stride, and we need stack slot to
hold the tile data register. We would run this pass after copy
propagation, so that we don't miss copy optimization. And we
would run this pass before prolog/epilog insertion, so that we
can allocate stack slot.

Differential Revision: https://reviews.llvm.org/D97112
2021-02-23 07:49:42 +08:00
Luo, Yuanke f80b29878b [X86] AMX programming model.
This patch implements amx programming model that discussed in llvm-dev
 (http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
 Thank Hal for the good suggestion in the RA. The fast RA is not in the patch yet.
 This patch implemeted 7 components.

1. The c interface to end user.
2. The AMX intrinsics in LLVM IR.
3. Transform load/store <256 x i32> to AMX intrinsics or split the
   type into two <128 x i32>.
4. The Lowering from AMX intrinsics to AMX pseudo instruction.
5. Insert psuedo ldtilecfg and build the def-use between ldtilecfg to amx
   intruction.
6. The register allocation for tile register.
7. Morph AMX pseudo instruction to AMX real instruction.

Change-Id: I935e1080916ffcb72af54c2c83faa8b2e97d5cb0

Differential Revision: https://reviews.llvm.org/D87981
2020-12-10 17:01:54 +08:00
Roman Lebedev e1dd212c87
[X86] Remove disabled miscompiling X86CondBrFolding pass
As briefly discussed in IRC with @craig.topper,
the pass is disabled basically since it's original introduction (nov 2018)
due to  known correctness issues (miscompilations),
and there hasn't been much work done to fix that.

While i won't promise that i will "fix" the pass,
i have looked at it previously, and i'm sure i won't try to fix it
if that requires actually fixing this existing code.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D84775
2020-07-28 23:35:04 +03:00
Zola Bridges dfabffb195 [x86][lvi][seses] Use SESES at O0 for LVI mitigation
Use SESES as the fallback at O0 where the optimized LVI pass isn't desired due
to its effect on build times at O0.

I updated the LVI tests since this changes the code gen for the tests touched in the parent revision.

This is a follow up to the comments I made here: https://reviews.llvm.org/D80964

Hopefully we can continue the discussion here.

Also updated SESES to handle LFENCE instructions properly instead of adding
redundant LFENCEs. In particular, 1) no longer add LFENCE if the current
instruction being processed is an LFENCE and 2) no longer add LFENCE if the
instruction right before the instruction being processed is an LFENCE

Reviewed By: sconstab

Differential Revision: https://reviews.llvm.org/D82037
2020-07-07 11:05:09 -07:00
Scott Constable 7e06cf0011 [X86] Add an Unoptimized Load Value Injection (LVI) Load Hardening Pass
@nikic raised an issue on D75936 that the added complexity to the O0 pipeline was causing noticeable slowdowns for `-O0` builds. This patch addresses the issue by adding a pass with equal security properties, but without any optimizations (and more importantly, without the need for expensive analysis dependencies).

Reviewers: nikic, craig.topper, mattdr

Reviewed By: craig.topper, mattdr

Differential Revision: https://reviews.llvm.org/D80964
2020-06-10 15:31:47 -07:00
Scott Constable e97a3e5d9d [X86] Add a Pass that builds a Condensed CFG for Load Value Injection (LVI) Gadgets
Adds a new data structure, ImmutableGraph, and uses RDF to find LVI gadgets and add them to a MachineGadgetGraph.

More specifically, a new X86 machine pass finds Load Value Injection (LVI) gadgets consisting of a load from memory (i.e., SOURCE), and any operation that may transmit the value loaded from memory over a covert channel, or use the value loaded from memory to determine a branch/call target (i.e., SINK).

Also adds a new target feature to X86: +lvi-load-hardening

The feature can be added via the clang CLI using -mlvi-hardening.

Differential Revision: https://reviews.llvm.org/D75936
2020-05-11 13:08:35 -07:00
Zola Bridges bf95cf4a68 [x86][seses] Introduce SESES pass for LVI
This is an implementation of Speculative Execution Side Effect
Suppression which is intended as a last resort mitigation against Load
Value Injection, LVI, a newly disclosed speculative execution side
channel vulnerability.

One pager:
https://software.intel.com/security-software-guidance/software-guidance/load-value-injection

Deep dive:
https://software.intel.com/security-software-guidance/insights/deep-dive-load-value-injection

The mitigation consists of a compiler pass that inserts an LFENCE before
each memory read instruction, memory write instruction, and the first
branch instruction in a group of terminators at the end of a basic
block. The goal is to prevent speculative execution, potentially based
on misspeculated conditions and/or containing secret data, from leaking
that data via side channels embedded in such instructions.

This is something of a last-resort mitigation: it is expected to have
extreme performance implications and it may not be a complete mitigation
due to trying to enumerate side channels.

In addition to the full version of the mitigation, this patch
implements three flags to turn off part of the mitigation. These flags
are disabled by default. The flags are not intended to result in a
secure variant of the mitigation. The flags are intended to be used by
users who would like to experiment with improving the performance of
the mitigation. I ran benchmarks with each of these flags enabled in
order to find if there was any room for further optimization of LFENCE
placement with respect to LVI.

Performance Testing Results

When applying this mitigation to BoringSSL, we see the following
results. These are a summary/aggregation of the performance changes when
this mitigation is applied versus when no mitigation is applied.

Fully Mitigated vs Baseline
Geometric mean
0.071 (Note: This can be read as the ops/s of the mitigated
program was 7.1% of the ops/s of the unmitigated program.)
Minimum
0.041
Quartile 1
0.060
Median
0.063
Quartile 3
0.077
Maximum
0.230

Reviewed By: george.burgess.iv

Differential Revision: https://reviews.llvm.org/D75939
2020-05-11 09:34:37 -07:00
Craig Topper 52a6d47ada [X86] Add initialize function for X86FixupSetCC so that it will show up in print-after-all. 2020-04-28 23:31:34 -07:00
Simon Pilgrim 18bf42a86c X86.h - remove unused forward declarations. NFC. 2020-04-19 14:28:52 +01:00
Craig Topper 1d42c0db9a Revert "[X86] Add a Pass that builds a Condensed CFG for Load Value Injection (LVI) Gadgets"
This reverts commit c74dd640fd.

Reverting to address coding standard issues raised in post-commit
review.
2020-04-03 16:56:08 -07:00
Scott Constable c74dd640fd [X86] Add a Pass that builds a Condensed CFG for Load Value Injection (LVI) Gadgets
Adds a new data structure, ImmutableGraph, and uses RDF to find LVI gadgets and add them to a MachineGadgetGraph.

More specifically, a new X86 machine pass finds Load Value Injection (LVI) gadgets consisting of a load from memory (i.e., SOURCE), and any operation that may transmit the value loaded from memory over a covert channel, or use the value loaded from memory to determine a branch/call target (i.e., SINK).

Also adds a new target feature to X86: +lvi-load-hardening

The feature can be added via the clang CLI using -mlvi-hardening.

Differential Revision: https://reviews.llvm.org/D75936
2020-04-03 13:02:04 -07:00
Scott Constable f95a67d8b8 [X86] Add RET-hardening Support to mitigate Load Value Injection (LVI)
Adding a pass that replaces every ret instruction with the sequence:

pop <scratch-reg>
lfence
jmp *<scratch-reg>

where <scratch-reg> is some available scratch register, according to the
calling convention of the function being mitigated.

Differential Revision: https://reviews.llvm.org/D75935
2020-04-03 12:08:34 -07:00
Scott Constable 71e8021d82 [X86][NFC] Generalize the naming of "Retpoline Thunks" and related code to "Indirect Thunks"
There are applications for indirect call/branch thunks other than retpoline for Spectre v2, e.g.,

https://software.intel.com/security-software-guidance/software-guidance/load-value-injection

Therefore it makes sense to refactor X86RetpolineThunks as a more general capability.

Differential Revision: https://reviews.llvm.org/D76810
2020-04-02 21:55:13 -07:00
Craig Topper 9f7d4150b9 [X86] Move combineLoopMAddPattern and combineLoopSADPattern to an IR pass before SelecitonDAG.
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.

This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.

An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.

Differential Revision: https://reviews.llvm.org/D76649
2020-03-26 14:10:20 -07:00
Reid Kleckner 5ff5ddd0ad [Win64] Insert int3 into trailing empty BBs
Otherwise, the Win64 unwinder considers direct branches to such empty
trailing BBs to be a branch out of the function. It treats such a branch
as a tail call, which can only be part of an epilogue. If the unwinder
misclassifies such a branch as part of the epilogue, it will fail to
unwind the stack further. This can lead to bad stack traces, or failure
to handle exceptions properly. This is described in
https://llvm.org/PR45064#c4, and by the comment at the top of the
X86AvoidTrailingCallPass.cpp file.

It should be safe to insert int3 for such blocks. An empty trailing BB
that reaches this pass is pretty much guaranteed to be unreachable.  If
a program executed such a block, it would fall off the end of the
function.

Most of the complexity in this patch comes from threading through the
"EHFuncletEntry" boolean on the MIRParser and registering the pass so we
can stop and start codegen around it. I used an MIR test because we
should teach LLVM to optimize away these branches as a follow-up.

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D76531
2020-03-23 08:50:37 -07:00
Liu, Chen3 8fdafb7dce Insert wait instruction after X87 instructions which could raise
float-point exception.

This patch also modify some mayRaiseFPException flag which set in D68854.

Differential Revision: https://reviews.llvm.org/D72750
2020-01-16 12:12:51 +08:00
Amy Huang 9e978bb01c Add support for lowering 32-bit/64-bit pointers
Summary:
This follows a previous patch that changes the X86 datalayout to represent
mixed size pointers (32-bit sext, 32-bit zext, and 64-bit) with address spaces
(https://reviews.llvm.org/D64931)

This patch implements the address space cast lowering to the corresponding
sign extension, zero extension, or truncate instructions.

Related to https://bugs.llvm.org/show_bug.cgi?id=42359

Reviewers: rnk, craig.topper, RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D69639
2019-12-04 11:39:03 -08:00
Reid Kleckner bf02399a85 [Windows] Replace TrapUnreachable with an int3 insertion pass
This is an alternative to D66980, which was reverted. Instead of
inserting a pseudo instruction that optionally expands to nothing, add a
pass that inserts int3 when appropriate after basic block layout.

Reviewers: hans

Differential Revision: https://reviews.llvm.org/D67201

llvm-svn: 371466
2019-09-09 23:04:25 +00:00
Pengfei Wang 7630e24492 [X86] Making X86OptimizeLEAs pass public. NFC
Reviewers: wxiao3, LuoYuanke, andrew.w.kaylor, craig.topper, annita.zhang, liutianle, pengfei, xiangzhangllvm, RKSimon, spatel, andreadb

Reviewed By: RKSimon

Subscribers: andreadb, hiraditya, llvm-commits

Tags: #llvm

Patch by Gen Pei (gpei)

Differential Revision: https://reviews.llvm.org/D65933

llvm-svn: 369612
2019-08-22 02:29:27 +00:00
Simon Pilgrim 3a8c698771 [X86] Alphabetize pass initialization definitions. NFCI.
llvm-svn: 369126
2019-08-16 16:41:38 +00:00
Tom Stellard f335672218 X86: Clean up pass initialization
Summary:
- Remove redundant initializations from pass constructors that were
  already being initialized by LLVMInitializeX86Target().

- Add initialization function for the FPS pass.

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63218

llvm-svn: 363221
2019-06-13 02:09:32 +00:00
Francis Visoiu Mistrih ab051a378c [X86] Preserve operand flag when expanding TCRETURNri
The expansion of TCRETURNri(64) would not keep operand flags like
undef/renamable/etc. which can result in machine verifier issues.

Also add plumbing to be able to use `-run-pass=x86-pseudo`.

llvm-svn: 357808
2019-04-05 20:18:21 +00:00
Vlad Tsyrklevich 2e1479e2f2 Delete x86_64 ShadowCallStack support
Summary:
ShadowCallStack on x86_64 suffered from the same racy security issues as
Return Flow Guard and had performance overhead as high as 13% depending
on the benchmark. x86_64 ShadowCallStack was always an experimental
feature and never shipped a runtime required to support it, as such
there are no expected downstream users.

Reviewers: pcc

Reviewed By: pcc

Subscribers: mgorny, javed.absar, hiraditya, jdoerfert, cfe-commits, #sanitizers, llvm-commits

Tags: #clang, #sanitizers, #llvm

Differential Revision: https://reviews.llvm.org/D59034

llvm-svn: 355624
2019-03-07 18:56:36 +00:00
Chandler Carruth 2946cd7010 Update the file headers across all of the LLVM projects in the monorepo
to reflect the new license.

We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.

Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.

llvm-svn: 351636
2019-01-19 08:50:56 +00:00
Craig Topper ba3ab78291 [X86] Initialize and Register X86CondBrFoldingPass
To make X86CondBrFoldingPass can be run with --run-pass option, this can test one wrong assertion on analyzeCompare function for SUB32ri when its operand is not imm

Patch by Jianping Chen

Differential Revision: https://reviews.llvm.org/D55412

llvm-svn: 348620
2018-12-07 18:10:34 +00:00
Mircea Trofin f1a49e8525 Revert "Revert r347596 "Support for inserting profile-directed cache prefetches""
Summary:
This reverts commit d8517b96dfbd42e6a8db33c50d1fa1e58e63fbb9.

Fix: correct  the use of DenseMap.

Reviewers: davidxl, hans, wmi

Reviewed By: wmi

Subscribers: mgorny, eraman, llvm-commits

Differential Revision: https://reviews.llvm.org/D55088

llvm-svn: 347938
2018-11-30 01:01:52 +00:00
Hans Wennborg 6e3be9d12e Revert r347596 "Support for inserting profile-directed cache prefetches"
It causes asserts building BoringSSL. See https://crbug.com/91009#c3 for
repro.

This also reverts the follow-ups:
Revert r347724 "Do not insert prefetches with unsupported memory operands."
Revert r347606 "[X86] Add dependency from X86 to ProfileData after rL347596"
Revert r347607 "Add new passes to X86 pipeline tests"

llvm-svn: 347864
2018-11-29 13:58:02 +00:00
Mircea Trofin cfbc1788d6 Support for inserting profile-directed cache prefetches
Summary:
Support for profile-driven cache prefetching (X86)

This change is part of a larger system, consisting of a cache prefetches recommender, create_llvm_prof (https://github.com/google/autofdo), and LLVM.

A proof of concept recommender is DynamoRIO's cache miss analyzer. It processes memory access traces obtained from a running binary and identifies patterns in cache misses. Based on them, it produces a csv file with recommendations. The expectation is that, by leveraging such recommendations, we can reduce the amount of clock cycles spent waiting for data from memory. A microbenchmark based on the DynamoRIO analyzer is available as a proof of concept: https://goo.gl/6TM2Xp.

The recommender makes prefetch recommendations in terms of:

* the binary offset of an instruction with a memory operand;
* a delta;
* and a type (nta, t0, t1, t2)

meaning: a prefetch of that type should be inserted right before the instrution at that binary offset, and the prefetch should be for an address delta away from the memory address the instruction will access.

For example:

0x400ab2,64,nta

and assuming the instruction at 0x400ab2 is:

movzbl (%rbx,%rdx,1),%edx

means that the recommender determined it would be beneficial for a prefetchnta instruction to be inserted right before this instruction, as such:

prefetchnta 0x40(%rbx,%rdx,1)
movzbl (%rbx, %rdx, 1), %edx

The workflow for prefetch cache instrumentation is as follows (the proof of concept script details these steps as well):

1. build binary, making sure -gmlt -fdebug-info-for-profiling is passed. The latter option will enable the X86DiscriminateMemOps pass, which ensures instructions with memory operands are uniquely identifiable (this causes ~2% size increase in total binary size due to the additional debug information).

2. collect memory traces, run analysis to obtain recommendations (see above-referenced DynamoRIO demo as a proof of concept).

3. use create_llvm_prof to convert recommendations to reference insertion locations in terms of debug info locations.

4. rebuild binary, using the exact same set of arguments used initially, to which -mllvm -prefetch-hints-file=<file> needs to be added, using the afdo file obtained at step 3.

Note that if sample profiling feedback-driven optimization is also desired, that happens before step 1 above. In this case, the sample profile afdo file that was used to produce the binary at step 1 must also be included in step 4.

The data needed by the compiler in order to identify prefetch insertion points is very similar to what is needed for sample profiles. For this reason, and given that the overall approach (memory tracing-based cache recommendation mechanisms) is under active development, we use the afdo format as a syntax for capturing this information. We avoid confusing semantics with sample profile afdo data by feeding the two types of information to the compiler through separate files and compiler flags. Should the approach prove successful, we can investigate improvements to this encoding mechanism.

Reviewers: davidxl, wmi, craig.topper

Reviewed By: davidxl, wmi, craig.topper

Subscribers: davide, danielcdh, mgorny, aprantl, eraman, JDevlieghere, llvm-commits

Differential Revision: https://reviews.llvm.org/D54052

llvm-svn: 347596
2018-11-26 21:36:18 +00:00
Matthias Braun a9f900561e X86: Consistently declare pass initializers in X86.h; NFC
This avoids declaring them twice: in X86TargetMachine.cpp and the file
implementing the pass.

llvm-svn: 345801
2018-11-01 00:38:01 +00:00
Rong Xu 3d2efdfdea Recommit r343993: [X86] condition branches folding for three-way conditional codes
Fix the memory issue exposed by sanitizer.

llvm-svn: 344085
2018-10-09 22:03:40 +00:00
Rong Xu 47fd015163 [X86] Revert r343993 condition branches folding for three-way conditional codes
Some buildbots failed.

llvm-svn: 343998
2018-10-08 22:08:43 +00:00
Rong Xu 67b1b328f7 [X86] condition branches folding for three-way conditional codes
This patch implements a pass that optimizes condition branches on x86 by
taking advantage of the three-way conditional code generated by compare
instructions.

Currently, it tries to hoisting EQ and NE conditional branch to a dominant
conditional branch condition where the same EQ/NE conditional code is
computed. An example:
bb_0:
  cmp %0, 19
  jg bb_1
  jmp bb_2
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_4
bb_4:
  cmp %0, 20
  je bb_5
  jmp bb_6
Here we could combine the two compares in bb_0 and bb_4 and have the
following code:

bb_0:
  cmp %0, 20
  jg bb_1
  jl bb_2
  jmp bb_5
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_6

For the case of %0 == 20 (bb_5), we eliminate two jumps, and the control height
for bb_6 is also reduced. bb_4 is gone after the optimization.

This optimization is motivated by the branch pattern generated by the switch
lowering: we always have pivot-1 compare for the inner nodes and we do a pivot
compare again the leaf (like above pattern).

This pass currently is enabled on Intel's Sandybridge and later arches. Some
reviewers pointed out that on some arches (like AMD Jaguar), this pass may
increase branch density to the point where it hurts the performance of the
branch predictor.

Differential Revision: https://reviews.llvm.org/D46662

llvm-svn: 343993
2018-10-08 18:52:39 +00:00
Chandler Carruth 90358e1ef1 [SLH] Introduce a new pass to do Speculative Load Hardening to mitigate
Spectre variant #1 for x86.

There is a lengthy, detailed RFC thread on llvm-dev which discusses the
high level issues. High level discussion is probably best there.

I've split the design document out of this patch and will land it
separately once I update it to reflect the latest edits and updates to
the Google doc used in the RFC thread.

This patch is really just an initial step. It isn't quite ready for
prime time and is only exposed via debugging flags. It has two major
limitations currently:
1) It only supports x86-64, and only certain ABIs. Many assumptions are
   currently hard-coded and need to be factored out of the code here.
2) It doesn't include any options for more fine-grained control, either
   of which control flow edges are significant or which loads are
   important to be hardened.
3) The code is still quite rough and the testing lighter than I'd like.

However, this is enough for people to begin using. I have had numerous
requests from people to be able to experiment with this patch to
understand the trade-offs it presents and how to use it. We would also
like to encourage work to similar effect in other toolchains.

The ARM folks are actively developing a system based on this for
AArch64. We hope to merge this with their efforts when both are far
enough along. But we also don't want to block making this available on
that effort.

Many thanks to the *numerous* people who helped along the way here. For
this patch in particular, both Eric and Craig did a ton of review to
even have confidence in it as an early, rough cut at this functionality.

Differential Revision: https://reviews.llvm.org/D44824

llvm-svn: 336990
2018-07-13 11:13:58 +00:00
Chandler Carruth 19618fc639 [x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.

However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.

There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.

This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.

Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.

Differential Revision: https://reviews.llvm.org/D45146

llvm-svn: 329657
2018-04-10 01:41:17 +00:00
Vlad Tsyrklevich e3446017ed Add the ShadowCallStack pass
Summary:
The ShadowCallStack pass instruments functions marked with the
shadowcallstack attribute. The instrumented prolog saves the return
address to [gs:offset] where offset is stored and updated in [gs:0].
The instrumented epilog loads/updates the return address from [gs:0]
and checks that it matches the return address on the stack before
returning.

Reviewers: pcc, vitalybuka

Reviewed By: pcc

Subscribers: cryptoad, eugenis, craig.topper, mgorny, llvm-commits, kcc

Differential Revision: https://reviews.llvm.org/D44802

llvm-svn: 329139
2018-04-04 01:21:16 +00:00
Lama Saba 927468309f [X86] Reduce Store Forward Block issues in HW - Recommit after fixing Bug 36346
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Differential revision: https://reviews.llvm.org/D41330

Change-Id: Ib48836ccdf6005989f7d4466fa2035b7b04415d9
llvm-svn: 328973
2018-04-02 13:48:28 +00:00
Richard Smith ade53736b0 Revert r325128 ("[X86] Reduce Store Forward Block issues in HW").
This is causing miscompiles in some situations. See the llvm-commits thread for the commit for details.

llvm-svn: 325852
2018-02-23 01:43:46 +00:00
Lama Saba fe1016c485 [X86] Reduce Store Forward Block issues in HW - Recommit after fixing Bug 36346
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Change-Id: Ic41aa9ade6512e0478db66e07e2fde41b4fb35f9
llvm-svn: 325128
2018-02-14 14:58:53 +00:00
Hans Wennborg 7e19dfc45f Revert r324835 "[X86] Reduce Store Forward Block issues in HW"
It asserts building Chromium; see PR36346.

(This also reverts the follow-up r324836.)

> If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
> A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
> The estimated penalty for a store forward block is ~13 cycles.
>
> This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
> of a load and a store.
>
> The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
> breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

llvm-svn: 324887
2018-02-12 12:43:39 +00:00
Lama Saba c2ba6c387e [X86] Reduce Store Forward Block issues in HW
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Change-Id: I620b6dc91583ad9a1444591e3ddc00dd25d81748
llvm-svn: 324835
2018-02-11 09:34:12 +00:00
Chandler Carruth 0dcee4fe7a [x86] Make the retpoline thunk insertion a machine function pass.
Summary:
This removes the need for a machine module pass using some deeply
questionable hacks. This should address PR36123 which is a case where in
full LTO the memory usage of a machine module pass actually ended up
being significant.

We should revert this on trunk as soon as we understand and fix the
memory usage issue, but we should include this in any backports of
retpolines themselves.

Reviewers: echristo, MatzeB

Subscribers: sanjoy, mcrosier, mehdi_amini, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D42726

llvm-svn: 323915
2018-01-31 20:56:37 +00:00
Chandler Carruth c58f2166ab Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
Summary:
First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

We also support a target feature that disables emission of the retpoline
thunk by the compiler to allow for custom thunks if users want them.
These are particularly useful in environments like kernels that
routinely do hot-patching on boot and want to hot-patch their thunk to
different code sequences. They can write this custom thunk and use
`-mretpoline-external-thunk` *in addition* to `-mretpoline`. In this
case, on x86-64 thu thunk names must be:
```
  __llvm_external_retpoline_r11
```
or on 32-bit:
```
  __llvm_external_retpoline_eax
  __llvm_external_retpoline_ecx
  __llvm_external_retpoline_edx
  __llvm_external_retpoline_push
```
And the target of the retpoline is passed in the named register, or in
the case of the `push` suffix on the top of the stack via a `pushl`
instruction.

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch `-mretpoline` is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with `-mretpoline` and additionally link the dynamic
executable and all shared libraries with LLD and pass `-z retpolineplt`
(or use similar functionality from some other linker). We strongly
recommend also using `-z now` as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to `-mretpoline` to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

Reviewers: echristo, rnk, ruiu, craig.topper, DavidKreitzer

Subscribers: sanjoy, emaste, mcrosier, mgorny, mehdi_amini, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D41723

llvm-svn: 323155
2018-01-22 22:05:25 +00:00