Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.
A quick fix is to store a copy of the string, but without changing the
upstream char*.
An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.
We chose the first option for this change
Resolve#119808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
Summary:
When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here:
31e59766e7/torch/_inductor/ir.py (L3805-L3816)
This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion:
31e59766e7/torch/_inductor/ir.py (L3479)
Here we add a special case handling for this to unwrap `x` recursively.
Test Plan:
This local repro:
```
torch.compile()
def f(a, b, mat1, mat2):
bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1)
return torch.addmm(bias, mat1, mat2)
f(
torch.randn(3992, 20, 40).cuda(),
torch.randn(3992, 40, 192).cuda(),
torch.empty(3992, 1024).cuda(),
torch.empty(1024, 3840).cuda(),
)
```
with this line:
690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)
changed to `if cond(*args, **kwargs):` fails before and succeeds after this PR.
Differential Revision: D53743146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867
Approved by: https://github.com/xw285cornell
As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator), we are experimenting with using CodeLlama-powered information retrieval for target determination.
The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests.
This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3.
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
Summary:
Fix for a case where --run-path option fails to exit if the script exits with non-error status code.
When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit.
Test Plan:
cat /tmp/script.py
~~~
import sys
def main():
exit_code = 1
if len(sys.argv) > 1:
exit_code = int(sys.argv[1])
sys.exit(exit_code)
if __name__=="__main__":
main()
~~~
Case of exit code with 0 (prior behavior - never exits):
torchrun --run-path /tmp/script.py 0
~~~
[2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
(conda:pytorch) ➜ workspace echo $?
0
~~~
Existing behavior for non-zero exit code still works:
torchrun --run-path /tmp/script.py
~~~
(conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py
[2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Traceback (most recent call last):
File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main
run(args)
File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-12_09:16:25
host : kurman-mbp.dhcp.thefacebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 64668)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
~~~
Differential Revision: D53653874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697
Approved by: https://github.com/wconstab
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
# Summary
Initially reported in https://github.com/pytorch/pytorch/issues/119320
I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions:
https://godbolt.org/z/3sKqEqn4M
However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working
### Update:
I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577
Approved by: https://github.com/yifuwang
Summary:
This PR is a refactor of semi-structured sparsity support.
**deprecation**:
Before `torch.sparse.to_sparse_semi_structured` had a kwarg param
`transposed=False`, which has been removed. This kwarg was unused and
now thros a deprecation warning.
Namely, I've taken the subclassing implementation that xFormers has
created and brought it over to PyTorch, as part of our plan to upstream
runtime 2:4 sparsity.
I've also copied over all the op support that Daniel implemenented that
did not depend on the fast sparsification routines, into
`_sparse_semi_structured_ops.py`
With this subclass, all of our internal tests pass, as well as those in
xFormers.
The main change is that we now define a base subclass,
`SparseSemiStructuredTensor` that is inherited from for each of the
specific backends.
We also now can arbitrarily override the sparse dispatch table with
`_load_dispatch_table()`, idea being this is still general enough
where users don't need to modify pytorch source code to get their model
working.
This also adds in padding support and stores alg_id and fuse_transpose
as flags on the tensor, instead of hardcoding them.
There still remains two components in xFormers that will need to be
ported over eventually:
- the autograd functions (`Sparsify24`, `Sparsify24_like`)
- fast sparsification routines that they rely on
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302
Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~MTIAHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~CUDAHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~MPSHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
```
Likely introduced by https://github.com/pytorch/pytorch/pull/119329
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.
Fixes#98467
Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).
Fusion is still slow - but at least finishes.
After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s
Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
By just calling `std_mps` and `mean` in sequence
Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer
import torch
from torch.utils.benchmark import Measurement, Timer
def bench_var_mean(
m, n, k,
dtype = torch.float32,
device:str = "cpu",
) -> Measurement:
setup = f"""
x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
"""
t = Timer(
stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
)
return t.blocked_autorange()
for x in [100, 1000]:
rc = bench_var_mean(1000, x, 100, device="mps")
print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.
Fixes https://github.com/pytorch/pytorch/issues/119663
TODOs:
- Refactor the codebase and implement proper composite function (that must be faster)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
Tune the grid and block sizes for ROCm. Add a contig kernel separate from aligned+contig.
Verified new performance using pytorch/benchmarks/operator_benchmark.
`python -m pt.cat_test --device=cuda --tag-filter all`
On MI200 this improved performance on average 4%, and on MI300 14%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685
Approved by: https://github.com/malfet
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.
Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock
Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed:
```
File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph
g = graph_drawer.FxGraphDrawer(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767
Approved by: https://github.com/eellison