At minimum handle the s64 insert type, which are emitted in real cases
during legalization.
We really need TableGen to emit something to emit something like the
inverse of composeSubRegIndices do determine the subreg index to use.
llvm-svn: 373938
Allows targets to introduce regbankselectable
pseudo-instructions. Currently the closet feature to this is an
intrinsic. However this requires creating a public intrinsic
declaration. This litters the public intrinsic namespace with
operations we don't necessarily want to expose to IR producers, and
would rather leave as private to the backend.
Use a new instruction bit. A previous attempt tried to keep using enum
value ranges, but it turned into a mess.
llvm-svn: 373937
This ensures that frame-based unwinding will continue to work when
calling a noreturn function; there is not much use having the caller's
frame pointer saved if you don't also have the caller's program counter.
Patch by James Clarke.
Differential Revision: https://reviews.llvm.org/D68542
llvm-svn: 373907
Earlier in the year intrinsics for lrint, llrint, lround and llround were
added to llvm. The constrained versions are now implemented here.
Reviewed by: andrew.w.kaylor, craig.topper, cameron.mcinally
Approved by: craig.topper
Differential Revision: https://reviews.llvm.org/D64746
llvm-svn: 373900
The GFX10-DENORM-STRICT checks were only passing by accident. Fix them
to make the test more robust in the face of scheduling or register
allocation changes.
llvm-svn: 373893
Move the erasing and iterator updating inside to match the
other slow LEA function.
I've adapted code from optTwoAddrLEA and basically rebuilt the
implementation here. We do lose the kill flags now just like
optTwoAddrLEA. This runs late enough in the pipeline that
shouldn't really be a problem.
llvm-svn: 373877
If a fp scalar is loaded and then used as both a scalar and a vector broadcast, perform the load as a broadcast and then extract the scalar for 'free' from the 0th element.
This involved switching the order of the X86ISD::BROADCAST combines so we only convert to X86ISD::BROADCAST_LOAD once all other canonicalizations have been attempted.
Adds a DAGCombinerInfo::recursivelyDeleteUnusedNodes wrapper.
Fixes PR43217
Differential Revision: https://reviews.llvm.org/D68544
llvm-svn: 373871
Summary:
The default legalization for v16i64->v16i8 tries to create a multiple stage truncate concatenating after each stage and truncating again. But avx512 implements truncates with multiple uops. So it should be better to truncate all the way to the desired element size and then concatenate the pieces using unpckl instructions. This minimizes the number of 2 uop truncates. The unpcks are all single uop instructions.
I tried to handle this by just custom splitting the v16i64->v16i8 shuffle. And hoped that the DAG combiner would leave the two halves in the state needed to make D68374 do the job for each half. This worked for the first half, but the second half got messed up. So I've implemented custom handling for v8i64->v8i8 when v8i64 needs to be split to produce the VTRUNCs directly.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68428
llvm-svn: 373864
Summary: The VSELECT splitting code tries to split a setcc input as well. But on avx512 where mask registers are well supported it should be better to just split the mask and use a single compare.
Reviewers: RKSimon, spatel, efriedma
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68359
llvm-svn: 373863
Summary: It ensures that the bswap is generated even when a part of the subtree already matches a bswap transform.
Reviewers: craig.topper, efriedma, RKSimon, lebedev.ri
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68250
llvm-svn: 373850
We can make use of the Zeroable mask to indicate which elements we can safely set to zero instead of creating a target shuffle mask on the fly.
This allows us to remove createTargetShuffleMask.
This is part of the work to fix PR43024 and allow us to use SimplifyDemandedElts to simplify shuffle chains - we need to get to a point where the target shuffle masks isn't adjusted by its source inputs in setTargetShuffleZeroElements but instead we cache them in a parallel Zeroable mask.
llvm-svn: 373846
As discussed on PR42025, with more complex boolean math we can end up with many truncations/extensions of the comparison results through each bitop.
This patch handles the cases introduced in combineBitcastvxi1 by pushing the sign extension through the AND/OR/XOR ops so its just the original SETCC ops that gets extended.
Differential Revision: https://reviews.llvm.org/D68226
llvm-svn: 373834
Added some tests testing urem and srem operations with a constant divisor.
Patch by TG908 (Tim Gymnich)
Differential Revision: https://reviews.llvm.org/D68421
llvm-svn: 373830
This is an omission in rL371441. Loads which happened to be unordered weren't being added to the PendingLoad set, and thus weren't be ordered w/respect to side effects which followed before the end of the block.
Included test case is how I spotted this. We had an atomic load being folded into a using instruction after a fence that load was supposed to be ordered with. I'm sure it showed up a bunch of other ways as well.
Spotted via manual inspecting of assembly differences in a corpus w/and w/o the new experimental mode. Finding this with testing would have been "unpleasant".
llvm-svn: 373814
This reverts r371177 (git commit f879c68755)
It caused PR43566 by removing empty, address-taken MachineBasicBlocks.
Such blocks may have references from blockaddress or other operands, and
need more consideration to be removed.
See the PR for a test case to use when relanding.
llvm-svn: 373805
Outlining from noreturn functions doesn't do the correct thing right now. The
outliner should respect that the caller is marked noreturn. In the event that
we have a noreturn function, and the outlined code is in tail position, the
outliner will not see that the outlined function should be tail called. As a
result, you end up with a regular call containing a return.
Fixing this requires that we check that all candidates live inside noreturn
functions. So, for the sake of correctness, don't outline from noreturn
functions right now.
Add machine-outliner-noreturn.mir to test this.
llvm-svn: 373791
InstrEmitter's virtual register handling assumes that clones are emitted
after the cloned node. Make sure this assumption actually holds.
Fixes a "Node emitted out of order - early" assertion on the testcase.
This is probably a very rare case to actually hit in practice; even
without the explicit edge, the scheduler will usually end up scheduling
the nodes in the expected order due to other constraints.
Differential Revision: https://reviews.llvm.org/D68068
llvm-svn: 373782
The immediate form of VPCMP can represent these completely. The
vpcmpgt/eq are just shorter encodings.
This patch removes the isel patterns and just swaps the opcodes
and removes the immediate in MCInstLower. This matches where we do
some other encodings tricks.
Removes over 10K bytes from the isel table.
Differential Revision: https://reviews.llvm.org/D68446
llvm-svn: 373766
We already do this for ISD::TRUNCATE, but we can do the same for X86ISD::VTRUNC
Differential Revision: https://reviews.llvm.org/D68432
llvm-svn: 373765
A set of function attributes is required in any function that uses constrained
floating point intrinsics. None of our tests use these attributes.
This patch fixes this.
These tests have been tested against the IR verifier changes in D68233.
Reviewed by: andrew.w.kaylor, cameron.mcinally, uweigand
Approved by: andrew.w.kaylor
Differential Revision: https://reviews.llvm.org/D67925
llvm-svn: 373761
Darwin platforms need the frame register to always point at a valid record even
if it's not updated in a leaf function. Backtraces are more important than one
extra GPR.
llvm-svn: 373738
We would like to split the SP adjustment to reduce the instructions in
prologue and epilogue as the following case. In this way, the offset of
the callee saved register could fit in a single store.
add sp,sp,-2032
sw ra,2028(sp)
sw s0,2024(sp)
sw s1,2020(sp)
sw s3,2012(sp)
sw s4,2008(sp)
add sp,sp,-64
Differential Revision: https://reviews.llvm.org/D68011
llvm-svn: 373688
As discussed on llvm-dev and:
https://bugs.llvm.org/show_bug.cgi?id=43542
...we have transforms that assume shift operations are legal and transforms to
use them are profitable, but that may not hold for simple targets.
In this case, the MSP430 target custom lowers shifts by repeating (many)
simpler/fixed ops. That can be avoided by keeping this code as setcc/select.
Differential Revision: https://reviews.llvm.org/D68397
llvm-svn: 373666
The IR was using a fixed 8 byte alignment, but the MIR portion was using native alignment. Since the test doesn't appear to be deliberately testing overalignment, just make the IR match the MIR.
llvm-svn: 373658
Summary:
This is follow up patch of https://reviews.llvm.org/D67595.
Adjust naming and the Commutable operands for additional patterns
to make it easier to read.
The testcase update also show that we can save some unecessary fmr as
well.
Reviewers: #powerpc, steven.zhang, hfinkel, nemanjai
Reviewed By: #powerpc, nemanjai
Subscribers: wuzish, hiraditya, kbarton, MaskRay, shchenz, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68112
llvm-svn: 373652
This patch recognizes the shuffle pattern we get from a
v8i64->v8i8 truncate when v8i64 isn't a legal type.
With VLX we can use two VTRUNCs, unpckldq, and a insert_subvector.
Diffrential Revision: https://reviews.llvm.org/D68374
llvm-svn: 373645
Register indexing 64-bit elements is possible on the SALU, but not the
VALU. Handle splitting this into two 32-bit indexes. Extend waterfall
loop handling to allow moving a range of instructions.
llvm-svn: 373638
We can still do a waterfall loop over the index if using a VGPR to
index an SGPR. The result will still be a VGPR, but we can avoid the
wide copy of the source register to a VGPR.
llvm-svn: 373637
The Hexagon code assumes there's no existing terminator when inserting its
trip count condition check.
This causes swp-stages5.ll to break. The generated code looks good to me,
it is likely a permutation. I have disabled the new codegen path to keep
everything green and will investigate along with the other 3-4 tests
that have different codegen.
Fixes expensive-checks build.
llvm-svn: 373629
During studying support for bitfield, I found an issue for
an example like the one in test offset-reloc-middle-chain.ll.
struct t1 { int c; };
struct s1 { struct t1 b; };
struct r1 { struct s1 a; };
#define _(x) __builtin_preserve_access_index(x)
void test1(void *p1, void *p2, void *p3);
void test(struct r1 *arg) {
struct s1 *ps = _(&arg->a);
struct t1 *pt = _(&arg->a.b);
int *pi = _(&arg->a.b.c);
test1(ps, pt, pi);
}
The IR looks like:
%0 = llvm.preserve.struct.access(base, ...)
%1 = llvm.preserve.struct.access(%0, ...)
%2 = llvm.preserve.struct.access(%1, ...)
using %0, %1 and %2
In this case, we need to generate three relocatiions
corresponding to chains: (%0), (%0, %1) and (%0, %1, %2).
After collecting all the chains, the current implementation
process each chain (in a map) with code generation sequentially.
For example, after (%0) is processed, the code may look like:
%0 = base + special_global_variable
// llvm.preserve.struct.access(base, ...) is delisted
// from the instruction stream.
%1 = llvm.preserve.struct.access(%0, ...)
%2 = llvm.preserve.struct.access(%1, ...)
using %0, %1 and %2
When processing chain (%0, %1), the current implementation
tries to visit intrinsic llvm.preserve.struct.access(base, ...)
to get some of its properties and this caused segfault.
This patch fixed the issue by remembering all necessary
information (kind, metadata, access_index, base) during
analysis phase, so in code generation phase there is
no need to examine the intrinsic call instructions.
This also simplifies the code.
Differential Revision: https://reviews.llvm.org/D68389
llvm-svn: 373621
Adds support to AArch64FrameLowering to allocate fixed-stack SVE objects.
The focus of this patch is purely to allow the stack frame to
allocate/deallocate space for scalable SVE objects. More dynamic
allocation (at compile-time, i.e. determining placement of SVE objects
on the stack), or resolving frame-index references that include
scalable-sized offsets, are left for subsequent patches.
SVE objects are allocated in the stack frame as a separate region below
the callee-save area, and above the alignment gap. This is done so that
the SVE objects can be accessed directly from the FP at (runtime)
VL-based offsets to benefit from using the VL-scaled addressing modes.
The layout looks as follows:
+-------------+
| stack arg |
+-------------+
| Callee Saves|
| X29, X30 | (if available)
|-------------| <- FP (if available)
| : |
| SVE area |
| : |
+-------------+
|/////////////| alignment gap.
| : |
| Stack objs |
| : |
+-------------+ <- SP after call and frame-setup
SVE and non-SVE stack objects are distinguished using different
StackIDs. The offsets for objects with TargetStackID::SVEVector should be
interpreted as purely scalable offsets within their respective SVE region.
Reviewers: thegameg, rovka, t.p.northover, efriedma, rengolin, greened
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D61437
llvm-svn: 373585
This improves broadcast load folding of i64 elements on 32-bit
targets where i64 isn't legal.
Previously we had to represent these as vXf64 vbroadcast_loads and
a bitcast to vXi64. But we didn't have any isel patterns
looking for that.
This also allows us to remove or simplify some isel patterns that
were looking for bitcasted vbroadcast_loads.
llvm-svn: 373566
When SIFixSGPRCopies attempts to fix an illegal copy from vector to
scalar register it calls moveToVALU(). A copy from an agpr to sgpr
becomes a copy from agpr to agpr, which may result in the illegal
register class at a use of this copy.
Solution is to copy it always into a vgpr. This may result in a
subsequent copy into an agpr if that is what really needed, however
should not happen too often and likely will be folded later.
The opposite situation may not happen because an sgpr is always
illegal where agpr is legal, so such user instructions may not
exist.
Differential Revision: https://reviews.llvm.org/D68358
llvm-svn: 373544
If the vselect result type needs to be split, it will try to
also try to split the condition if it happens to be a setcc.
With avx512 where k-registers are legal, its probably better
to just use a kshift to split the mask register.
llvm-svn: 373536
Store rlwinm Rx, Ry, 32, 0, 31 as rlwinm Rx, Ry, 0, 0, 31 and store
rldicl Rx, Ry, 64, 0 as rldicl Rx, Ry, 0, 0. Otherwise SH field is overflow and
fails assertion in assembly printing stage.
Differential Revision: https://reviews.llvm.org/D66991
llvm-svn: 373519
The previous code tried to do a trick where we would extract the subvector from the location we were inserting. Then xor that with the new value. Take the xored value and clear out the bits above the subvector size. Then shift that xored subvector to the insert location. And finally xor that with the original vector. Since the old subvector was used in both xors, this would leave just the new subvector at the inserted location. Since the surrounding bits had been zeroed no other bits of the original vector would be modified.
Unfortunately, if the old subvector came from undef we might aggressively propagate the undef. Then we end up with the XORs not cancelling because they aren't using the same value for the two uses of the old subvector. @bkramer gave me a case that demonstrated this, but we haven't reduced it enough to make it easily readable to see what's happening.
This patch uses a safer, but more costly approach. It isolate the bits above the insertion and bits below the insert point and ORs those together leaving 0 for the insertion location. Then widens the subvector with 0s in the upper bits, shifts it into position with 0s in the lower bits. Then we do another OR.
Differential Revision: https://reviews.llvm.org/D68311
llvm-svn: 373495
Summary:
64-bit WebAssembly (wasm64) is not specified and not supported in the
WebAssembly backend. We do have support for it in clang, however, and
we would like to keep that support because we expect wasm64 to be
specified and supported in the future. For now add an error when
trying to use wasm64 from the backend to minimize user confusion from
unexplained crashes.
Reviewers: aheejin, dschuff, sunfish
Subscribers: sbc100, jgravelle-google, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68254
llvm-svn: 373493
Summary:
Extend cachepolicy operand in the new VMEM buffer intrinsics
to supply information whether the buffer data is swizzled.
Also, propagate this information to MIR.
Intrinsics updated:
int_amdgcn_raw_buffer_load
int_amdgcn_raw_buffer_load_format
int_amdgcn_raw_buffer_store
int_amdgcn_raw_buffer_store_format
int_amdgcn_raw_tbuffer_load
int_amdgcn_raw_tbuffer_store
int_amdgcn_struct_buffer_load
int_amdgcn_struct_buffer_load_format
int_amdgcn_struct_buffer_store
int_amdgcn_struct_buffer_store_format
int_amdgcn_struct_tbuffer_load
int_amdgcn_struct_tbuffer_store
Furthermore, disable merging of VMEM buffer instructions
in SI Load/Store optimizer, if the "swizzled" bit on the instruction
is on.
The default value of the bit is 0, meaning that data in buffer
is linear and buffer instructions can be merged.
There is no difference in the generated code with this commit.
However, in the future it will be expected that front-ends
use buffer intrinsics with correct "swizzled" bit set.
Reviewers: arsenm, nhaehnle, tpr
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, arphaman, jfb, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68200
llvm-svn: 373491
This was reverted in r373454 due to breaking the expensive-checks bot.
This version addresses that by omitting the addSuccessorWithProb() call
when omitting the range check.
> Switch lowering: omit range check for bit tests when default is unreachable (PR43129)
>
> This is modeled after the same functionality for jump tables, which was
> added in r357067.
>
> Differential revision: https://reviews.llvm.org/D68131
llvm-svn: 373477
Summary:
This extends the PeelingModuloScheduleExpander to generate prolog and epilog code,
and correctly stitch uses through the prolog, kernel, epilog DAG.
The key concept in this patch is to ensure that all transforms are *local*; only a
function of a block and its immediate predecessor and successor. By defining the problem in this way
we can inductively rewrite the entire DAG using only local knowledge that is easy to
reason about.
For example, we assume that all prologs and epilogs are near-perfect clones of the
steady-state kernel. This means that if a block has an instruction that is predicated out,
we can redirect all users of that instruction to that equivalent instruction in our
immediate predecessor. As all blocks are clones, every instruction must have an equivalent in
every other block.
Similarly we can make the assumption by construction that if a value defined in a block is used
outside that block, the only possible user is its immediate successors. We maintain this
even for values that are used outside the loop by creating a limited form of LCSSA.
This code isn't small, but it isn't complex.
Enabled a bunch of testing from Hexagon. There are a couple of tests not enabled yet;
I'm about 80% sure there isn't buggy codegen but the tests are checking for patterns
that we don't produce. Those still need a bit more investigation. In the meantime we
(Google) are happy with the code produced by this on our downstream SMS implementation,
and believe it generates correct code.
Subscribers: mgorny, hiraditya, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68205
llvm-svn: 373462
Identity shuffles, of the form (0, 1, 2, 3, ...) are perfectly OK under MVE
(they essentially just become bitcasts). We were not catching that in the
existing set of what we considered legal though. On NEON, they would be covered
by vext's, but that is not generally available in MVE.
This uses ShuffleVectorInst::isIdentityMask which is a little odd to use here
but does what we want and prevents us from just rewriting what is the same
function.
Differential Revision: https://reviews.llvm.org/D68241
llvm-svn: 373446
This is modeled after the same functionality for jump tables, which was
added in r357067.
Differential revision: https://reviews.llvm.org/D68131
llvm-svn: 373431
These patterns use zmm registers for 128/256-bit compares when
the VLX instructions aren't available. Previously we only
supported registers, but as PR36191 notes we can fold broadcast
loads, but not regular loads.
llvm-svn: 373423
In principle this should behave as any other constant. However
eliminateFrameIndex currently assumes a VALU use and uses a vector
shift. Work around this by selecting to VGPR for now until
eliminateFrameIndex is fixed.
llvm-svn: 373415
Account and report agprs separately on gfx908. Other targets
do not change the reporting.
Differential Revision: https://reviews.llvm.org/D68307
llvm-svn: 373411
The gather/scatter instructions can implicitly sign extend the indices. If we're operating on 32-bit data, an v16i64 index can force a v16i32 gather to be split in two since the index needs 2 registers. If we can shrink the index to the i32 we can avoid the split. It should always be safe to shrink the index regardless of the number of elements. We have gather/scatter instructions that can use v2i32 index stored in a v4i32 register with v2i64 data size.
I've limited this to before legalize types to avoid creating a v2i32 after type legalization. We could check for it, but we'd also need testing. I'm also only handling build_vectors with no bitcasts to be sure the truncate will constant fold.
Differential Revision: https://reviews.llvm.org/D68247
llvm-svn: 373408
This seems to be causing some performance regresions that I'm
trying to investigate.
One thing that stands out is that this transform can increase
the live range of the operands of the earlier logic op. This
can be bad for register allocation. If there are two logic
op inputs we should really combine the one that is closest, but
SelectionDAG doesn't have a good way to do that. Maybe we need
to do this as a basic block transform in Machine IR.
llvm-svn: 373401
There are 1024 bit register classes defined for AGPRs. Additionally
OpenCL defines vectors up to 16 x i64, and this helps those tests
legalize.
llvm-svn: 373350
Summary:
This adds the ISD opcode and a DAG combine to create it. There are
probably some places where we can directly create it, but I'll
leave that for future work.
This updates all of the isel patterns to look for this new node.
I had to add a few additional isel patterns for aligned extloads
which we should probably fix with a DAG combine or something. This
does mean that the broadcast load folding for avx512 can no
longer match a broadcasted aligned extload.
There's still some work to do here for combining a broadcast of
a broadcast_load. We also need to improve extractelement or
demanded vector elements of a broadcast_load. I'll try to get
those done before I submit this patch.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68198
llvm-svn: 373349
Summary:
This patch implements Machine PostDominator Tree verification and ensures that the verification doesn't fail the in-tree tests.
MPDT verification can be enabled using `verify-machine-dom-info` -- the same flag used by Machine Dominator Tree verification.
Flipping the flag revealed that MachineSink falsely claimed to preserve CFG and MDT/MPDT. This patch fixes that.
Reviewers: arsenm, hliao, rampitec, vpykhtin, grosser
Reviewed By: hliao
Subscribers: wdng, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68235
llvm-svn: 373341
Previously the match was ambiguous and VMAXPS/PD and VMAXCPS/PD
were mapped to the same VEX instruction. But we should keep
the commutableness when change the opcode.
llvm-svn: 373303
Summary:
In CFGSort, we try to make EH pads have higher priorities as soon as
they are ready to be sorted, to prevent creation of unwind destination
mismatches in CFGStackify. We did that by making priority queues'
comparison function prefer EH pads, but it was possible for an EH pad
to be popped from `Preferred` queue and then not sorted immediately and
enter `Ready` queue instead in a certain condition. This patch makes
sure that special condition does not consider EH pads as its candidates.
Reviewers: dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68229
llvm-svn: 373302
Summary:
Fixing unwind mismatches for exception handling can result in splicing
existing BBs and moving some of instructions to new BBs. In this case
some of stackified def registers in the original BB can be used in the
split BB. For example, we have this BB and suppose %r0 is a stackified
register.
```
bb.1:
%r0 = call @foo
... use %r0 ...
```
After fixing unwind mismatches in CFGStackify, `bb.1` can be split and
some instructions can be moved to a newly created BB:
```
bb.1:
%r0 = call @foo
bb.split (new):
... use %r0 ...
```
In this case we should make %r0 un-stackified, because its use is now in
another BB.
When spliting a BB, this CL unstackifies all def registers that have
uses in the new split BB.
Reviewers: dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68218
llvm-svn: 373301
SelectionDAG has a bunch of machinery to defer this to selection time
for some reason. Just directly emit a copy during IRTranslator. The
x86 usage does somewhat questionably check hasFP, which could depend
on the whole function being at minimum translated.
This does lose the convergent bit if the callsite had it, which may be
a problem. We also lose that in general for intrinsics, which may also
be a problem.
llvm-svn: 373294
This is sort of papering over the fact that we don't run a combiner
anywhere, but avoiding creating 2 instructions in the first place is
easy.
llvm-svn: 373293
Also add a test case for an index that could be shrunk, but
would create a narrow type. We can go ahead and do it we just
need to be before type legalization.
Similar test cases for scatter as well.
llvm-svn: 373290