This implementation relies on storing data in registers for sizes up to 128B.
Then depending on whether `dst` is less (resp. greater) than `src` we move data forward (resp. backward) by chunks of 32B.
We first make sure one of the pointers is aligned to increase performance on large move sizes.
Differential Revision: https://reviews.llvm.org/D114637