Commit Graph

69523 Commits

Author SHA1 Message Date
laith sakka edd9ddf73f Propagate allow_non_graph_fake between get_fake_values_from_nodes and get_fake_values (#119731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119731
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #119314, #119435
2024-02-14 15:26:17 +00:00
cyy 87c6cd2f00 [1/N] Replace std::tie with structural binding (#119774)
This PR replaces some std::tie calls with structural binding from C++17.  This not only makes the code more compact, but also has some performance gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-14 09:25:04 +00:00
Shuqiang Zhang a45c627f27 [c10d][flight recorder] store a copy of string in entry (#119837)
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.

A quick fix is to store a copy of the string, but without changing the
upstream char*.

An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.

We chose the first option for this change

Resolve #119808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
2024-02-14 09:13:56 +00:00
Adnan Akhundov 4a50572c92 [inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867)
Summary:
When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here:

31e59766e7/torch/_inductor/ir.py (L3805-L3816)

This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion:

31e59766e7/torch/_inductor/ir.py (L3479)

Here we add a special case handling for this to unwrap `x` recursively.

Test Plan:
This local repro:

```
torch.compile()
def f(a, b, mat1, mat2):
    bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1)
    return torch.addmm(bias, mat1, mat2)
f(
    torch.randn(3992, 20, 40).cuda(),
    torch.randn(3992, 40, 192).cuda(),
    torch.empty(3992, 1024).cuda(),
    torch.empty(1024, 3840).cuda(),
)
```

with this line:

690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)

changed to `if cond(*args, **kwargs):` fails before and succeeds after this PR.

Differential Revision: D53743146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867
Approved by: https://github.com/xw285cornell
2024-02-14 07:50:34 +00:00
Michael Lazos 9f44274373 Add tests to verify disabled optimizers (#118919)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919
Approved by: https://github.com/janeyx99
2024-02-14 07:45:16 +00:00
Omkar Salpekar ca55468416 Target Determinator Indexer Workflow (#118824)
As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator),  we are experimenting with using CodeLlama-powered information retrieval for target determination.

The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests.

This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3.
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-02-14 06:21:18 +00:00
PyTorch UpdateBot caf9d9d7c1 [executorch hash update] update the pinned executorch hash (#119733)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119733
Approved by: https://github.com/pytorchbot
2024-02-14 06:15:25 +00:00
Yanbo Liang 54a30f6d4e [Dynamo] Update trace_rules.py and re-enable skipped tests (#119860)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119860
Approved by: https://github.com/angelayi
2024-02-14 05:22:55 +00:00
Oguz Ulgen 8ba2675488 Fix for-loop divisibility parsing (#119859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119859
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835, #119836, #119838
2024-02-14 05:09:59 +00:00
Oguz Ulgen 1f0e4ac146 Add support for while-loops in ttir analysis (#119838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119838
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835, #119836
2024-02-14 05:09:59 +00:00
Oguz Ulgen 5ffac768f6 Add support for labels to ttir analysis (#119836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119836
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835
2024-02-14 05:09:59 +00:00
Oguz Ulgen 3f09c5ee66 Add TTIR verification (#119835)
Make sure the TTIR generated is valid before attempting to analyze. Incorrectly written triton code would produce broken TTIR. Minor discussion on https://github.com/openai/triton/issues/3120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119835
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834
2024-02-14 05:09:59 +00:00
Oguz Ulgen b257ff80da Add test scf.for with multi return (#119834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119834
Approved by: https://github.com/aakhundov
2024-02-14 05:09:59 +00:00
Huy Do 72bbbab70a Add the missing test_dynamo_list_index from #119151 (D53392287) (#119854)
D53392287 botched the export somehow and the exported PR https://github.com/pytorch/pytorch/pull/119151 didn't contain the added test.  The discrepancy is showing up on diff train patch up diff D53694548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119854
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-02-14 04:10:02 +00:00
Bert Maher 563f1b9fef [inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662)
`triton.testing.nvsmi` invokes `nvidia-smi` as a subprocess, and Meta
prod usually doesn't make nvidia-smi available.  Might as well just use
something that's native to torch.

Differential Revision: [D53235814](https://our.internmc.facebook.com/intern/diff/D53235814/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118662
Approved by: https://github.com/jansel
2024-02-14 03:23:49 +00:00
Animesh Jain 80379ef0aa [dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853)
Fixes https://github.com/pytorch/pytorch/issues/119715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119853
Approved by: https://github.com/jansel
2024-02-14 03:14:42 +00:00
Kurman Karabukaev 4240304da4 [TorchElastic] Handle SystemExit with code == 0 (#119697)
Summary:
Fix for a case where --run-path option fails to exit if the script exits with non-error status code.
When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit.

Test Plan:
cat /tmp/script.py
~~~
import sys
def main():
    exit_code = 1
    if len(sys.argv) > 1:
        exit_code = int(sys.argv[1])
    sys.exit(exit_code)

if __name__=="__main__":
    main()
~~~

Case of exit code with 0 (prior behavior - never exits):
torchrun --run-path /tmp/script.py 0

~~~
[2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
(conda:pytorch) ➜  workspace echo $?
0
~~~

Existing behavior for non-zero exit code still works:
torchrun --run-path /tmp/script.py
~~~
(conda:pytorch) ➜  workspace torchrun --run-path /tmp/script.py
[2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]   File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]     self._pc.join(-1)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]   File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]     raise ProcessExitedException(
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Traceback (most recent call last):
  File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main
    run(args)
  File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-12_09:16:25
  host      : kurman-mbp.dhcp.thefacebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 64668)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
~~~

Differential Revision: D53653874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697
Approved by: https://github.com/wconstab
2024-02-14 03:09:09 +00:00
Aaron Meurer 5ce305270b Add a decomposition for isin() (#115390)
Co-authored-by: Peter Bell <peterbell10@live.co.uk>
Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390
Approved by: https://github.com/peterbell10
2024-02-14 03:03:42 +00:00
Jason Ansel 75a6d6aef7 [inductor] Support storage resizing (#119749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749
Approved by: https://github.com/yf225
ghstack dependencies: #119647, #119671
2024-02-14 03:03:38 +00:00
Joel Schlosser 31e59766e7 Fix meta registration for _flash_attention_forward() (#119812)
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
2024-02-14 02:38:53 +00:00
Huy Do 179ecab7e7 Do full checkout in lint workflow to rebuild new Docker images (#119858)
From https://github.com/pytorch/pytorch/pull/119575, using `fetch-depth: 1` didn't work for `calculate-docker-image` when rebuilding a new one.  Specifically, doing a full checkout is needed for `git rev-parse HEAD~:.ci/docker` to get the Docker tag.

This shows up as a trunk failure after the recent Docker image update 507db17675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119858
Approved by: https://github.com/PaliC, https://github.com/clee2000, https://github.com/malfet
2024-02-14 02:37:54 +00:00
Taras Tsugrii 690f54b0f5 [dynamo][nit] Cleanup analyze_kernel_mutations nits. (#119703)
Using `extend` is more efficient and other changes are stylistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119703
Approved by: https://github.com/Skylion007
2024-02-14 02:04:13 +00:00
Brian Hirsh f9f0c67445 beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826)
This extra check is needed for some more complicated parameter sizes/strides for an internal model

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119826
Approved by: https://github.com/albanD
2024-02-14 01:46:30 +00:00
drisspg c9459e7f55 Update atomicMaxFloat (#119577)
# Summary

Initially reported in https://github.com/pytorch/pytorch/issues/119320

I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions:
https://godbolt.org/z/3sKqEqn4M

However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working

### Update:
I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577
Approved by: https://github.com/yifuwang
2024-02-14 01:17:16 +00:00
suo 8e029dc616 [export] fix tuple return with symints (#119829)
as title.

Differential Revision: [D53726648](https://our.internmc.facebook.com/intern/diff/D53726648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119829
Approved by: https://github.com/zhxchen17, https://github.com/khabinov
2024-02-14 01:16:38 +00:00
PyTorch MergeBot 4a5b2cd6cb Revert "Windows Dynamo Error Removal CI Check (#115969)"
This reverts commit 45e7af5818.

Reverted https://github.com/pytorch/pytorch/pull/115969 on behalf of https://github.com/PaliC due to this pr ended up breaking some of our periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115969#issuecomment-1942934386))
2024-02-14 01:11:46 +00:00
Jesse Cai 16369816a2 [sparse] semi-structured sparse refactor (#117302)
Summary:

This PR is a refactor of semi-structured sparsity support.

**deprecation**:

Before `torch.sparse.to_sparse_semi_structured` had a kwarg param
`transposed=False`, which has been removed. This kwarg was unused and
now thros a deprecation warning.

Namely, I've taken the subclassing implementation that xFormers has
created and brought it over to PyTorch, as part of our plan to upstream
runtime 2:4 sparsity.

I've also copied over all the op support that Daniel implemenented that
did not depend on the fast sparsification routines, into
`_sparse_semi_structured_ops.py`

With this subclass, all of our internal tests pass, as well as those in
xFormers.

The main change is that we now define a base subclass,
`SparseSemiStructuredTensor` that is inherited from for each of the
specific backends.

We also now can arbitrarily override the sparse dispatch table with
`_load_dispatch_table()`, idea being this is still general enough
where users don't need to modify pytorch source code to get their model
working.

This also adds in padding support and stores alg_id and fuse_transpose
as flags on the tensor, instead of hardcoding them.

There still remains two components in xFormers that will need to be
ported over eventually:
- the autograd functions  (`Sparsify24`, `Sparsify24_like`)
- fast sparsification routines that they rely on

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302
Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles
2024-02-14 01:10:40 +00:00
Nikita Shulga 2536c5186e [BE] Properly mark destructor overrides (Take 2) (#119656)
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MTIAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~CUDAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MPSHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
```

 Likely introduced by https://github.com/pytorch/pytorch/pull/119329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
2024-02-14 01:05:58 +00:00
cyy cb0886ecf2 [DeviceIndex][4/N] Use DeviceIndex in more places (#119741)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741
Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang
2024-02-14 00:29:10 +00:00
suo b2e779868f make internal lintrunner mypy clean (#119840)
as title

Differential Revision: [D53732505](https://our.internmc.facebook.com/intern/diff/D53732505/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119840
Approved by: https://github.com/ezyang
2024-02-14 00:25:42 +00:00
angelayi 507db17675 Update HF pin (#119717)
Sometime between now and the previous pin update, HF introduced a
ModelOutputs type, which was not pytree serializable, causing
aot_compile to fail on new HF models
(https://fb.workplace.com/groups/1075192433118967/permalink/1377977852840422/).
With https://github.com/huggingface/transformers/pull/27871, we
can now pytree serialize HF ModelOutputs types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119717
Approved by: https://github.com/desertfire
2024-02-14 00:17:16 +00:00
Ozan Aydin b51e0246b7 sccache version update (#119554)
Fixes #37928

`sccache` is updated to the newer version (`v0.7.4`) to fix non-cacheable calls `multiple input files`  for `CUDA` builds.

This should make `Cache hits (CUDA)`  work as expected and improve the speed dramatically.

---

Additional information:

- Modified `install_sccache.bat` check structure due to GitHub Action error `Process completed with exit code 255.`
    - Error is occurring when freshly downloaded `sccache` is being called with `--show-stats` or `--start-server` arguments within the script
    - Now, it is checking file's existence and killing/deleting executable before the download

- Removed `sccache-cl` since it is no longer needed with newer versions of `sccache`

---

`win-vs2019-cpu-py3 / build` - `16m 27s`

![image](https://github.com/pytorch/pytorch/assets/148207261/b5628e6c-64bb-4293-9d07-480f56df44f1)

`win-vs2019-cuda11.8-py3 / build` - `17m 4s` **(previously ~45 mins - 1h30mins)**

![image](https://github.com/pytorch/pytorch/assets/148207261/e4ab01cb-0f56-41e8-984f-110e643b9c09)

Now `Cache Hits (CUDA)` hits all `304` object and the error `Non-cacheable reasons` is fixed.

![image](https://github.com/pytorch/pytorch/assets/148207261/c8c25d2e-3fc1-4edb-8982-99c1f490cb54)

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119554
Approved by: https://github.com/malfet
2024-02-13 23:50:40 +00:00
Edward Z. Yang be35fc9ea7 Size oblivious test for slice optimization (#119625)
Fixes https://github.com/pytorch/pytorch/issues/119623

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119625
Approved by: https://github.com/albanD
2024-02-13 23:47:52 +00:00
Andrew Gu d81d5f52d5 [FSDP2][ez] Replaced `groupby` with `all` for same-dtype check (#119825)
The `groupby` logic to check if all all-gather inputs have the same dtype is not so readable. Let us use `all` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119825
Approved by: https://github.com/Skylion007
ghstack dependencies: #119550, #118136, #118223, #118755
2024-02-13 23:28:53 +00:00
Jason Ansel cf117e37d5 Refactor THPStorage_resize_ (#119671)
Moving code around to allow it to be reused in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671
Approved by: https://github.com/yf225
ghstack dependencies: #119647
2024-02-13 23:28:47 +00:00
albanD ca777fbbb7 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-02-13 23:15:24 +00:00
Aaron Orenstein e9b78f2db0 Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324)
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.

Fixes #98467

Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).

Fusion is still slow - but at least finishes.

After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s

Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
2024-02-13 22:54:53 +00:00
Jeff Daily ba1eb0e27f [ROCm] upgrade CI to 6.0 (#119495)
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119495
Approved by: https://github.com/huydhn
2024-02-13 22:39:03 +00:00
blorange-amd df9b44436a [ROCm] Enable float16/complex32 fft tests on ROCm (#117296)
This PR is to enable float16/complex32 fft tests on ROCm.
Sample results are attached here:
[test_spectral_ops_results.log](https://github.com/pytorch/pytorch/files/13908533/test_spectral_ops_results.log)

test_decomp::TestDecompCUDA::test_comprehensive_fft*
test_decomp::TestDecompCUDA::test_quick_fft*
test_jit_fuser_te::TestNNCOpInfoCUDA::test_nnc_correctness_fft*
test_meta::TestMetaCUDA::test_dispatch_meta_inplace_fft*
test_meta::TestMetaCUDA::test_dispatch_meta_outplace_fft*
test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_inplace_fft*
test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_fft*
test_meta::TestMetaCUDA::test_meta_inplace_fft*
test_meta::TestMetaCUDA::test_meta_outplace_fft*
test_ops::TestCommonCUDA::test_complex_half_reference_testing_fft*
test_ops::TestCommonCUDA::test_python_ref__refs_fft*
test_ops::TestCommonCUDA::test_python_ref_executor__refs_fft*
test_ops::TestCommonCUDA::test_python_ref_meta__refs*
test_ops::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft*
test_schema_check::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft*
test_spectral_ops::TestFFTCUDA::test_empty_fft__refs_fft*
test_spectral_ops::TestFFTCUDA::test_empty_fft_fft*
test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error__refs_fft*
test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error_fft*
test_spectral_ops::TestFFTCUDA::test_fft_round_trip_cuda*
test_spectral_ops::TestFFTCUDA::test_fft_type_promotion_cuda*
test_spectral_ops::TestFFTCUDA::test_fftn_round_trip_cuda*
test_spectral_ops::TestFFTCUDA::test_hfftn_cuda_float16
test_spectral_ops::TestFFTCUDA::test_ihfftn_cuda_float16
test_utils::TestDeviceUtilsCUDA::test_device_mode_ops_fft

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117296
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-02-13 22:35:32 +00:00
Nikita Shulga 63d64c8995 [MPS] Enable more bfloat16 ops (#119738)
Introduce conveninence inlinable `mps::supportedFloatingType` function
that returns true if type is Float, Half or BFloat16

Test by running LLM inference using bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119738
Approved by: https://github.com/Skylion007
2024-02-13 22:11:00 +00:00
Nikita Shulga eb9a3383c2 [MPS] Add naive std_mean implementation (#119777)
By just calling `std_mps` and `mean` in sequence

Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer

import torch
from torch.utils.benchmark import Measurement, Timer

def bench_var_mean(
    m, n, k,
    dtype = torch.float32,
    device:str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()

for x in [100, 1000]:
    rc = bench_var_mean(1000, x, 100, device="mps")
    print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.

Fixes https://github.com/pytorch/pytorch/issues/119663

TODOs:
 - Refactor the codebase and implement proper composite function (that must be faster)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
2024-02-13 21:51:29 +00:00
Jeff Daily ee5b59dd4b [ROCm] CatArrayBatchedCopy performance improvement (#118685)
Tune the grid and block sizes for ROCm.  Add a contig kernel separate from aligned+contig.

Verified new performance using pytorch/benchmarks/operator_benchmark.

`python -m pt.cat_test --device=cuda --tag-filter all`

On MI200 this improved performance on average 4%, and on MI300 14%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685
Approved by: https://github.com/malfet
2024-02-13 21:51:20 +00:00
Edward Z. Yang 6665b96ebb Rewrite maybe_reduce more carefully for unbacked SymInt (#119562)
Fixes https://github.com/pytorch/pytorch/issues/119476

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562
Approved by: https://github.com/albanD
ghstack dependencies: #119559
2024-02-13 21:40:06 +00:00
Ke Wen 28f299a870 [c10d] Fix compilation of NCCL_EXP path (#119805)
Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621

When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly.

Cc: @kunalb @H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805
Approved by: https://github.com/H-Huang
2024-02-13 21:26:59 +00:00
Aaron Gokaslan f9200c8608 [BE][Ez]: FURB129: remove unneeded readlines() (#119796)
Applies a refurb rule to remove any readlines() in a for loop iteration as it just creates a temporary list in memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119796
Approved by: https://github.com/ezyang
2024-02-13 21:21:22 +00:00
Guilherme Leobas 3319dbcd23 Update vmap guard to avoid recompilations (#119061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061
Approved by: https://github.com/zou3519
2024-02-13 20:50:23 +00:00
Shuqiang Zhang abadbbc4b0 [c10d][flight recorder] remove unintended assignment of entry (#119748)
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.

Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock

Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
2024-02-13 20:18:58 +00:00
Catherine Lee 34638c82a6 [mergebot] No unique behavior for facebook bot re pending jobs (#119735)
if fb bot says merge without -f, do normal behavior and wait for pending checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119735
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-02-13 20:07:24 +00:00
vfdev 8ec3d8e35f Fixed FxGraphDrawer compat constructor (#119767)
Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed:
```
  File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph
    g = graph_drawer.FxGraphDrawer(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767
Approved by: https://github.com/eellison
2024-02-13 19:36:01 +00:00
andrewor14 8ec8d78ef2 [quant][pt2e][be] Rename eval_utils -> export_utils (#119725)
It's not really eval_utils anymore, since we added some training
related utils. Instead it should be util functions that are
related to general export use cases.

Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725
Approved by: https://github.com/tugsbayasgalan
2024-02-13 19:10:06 +00:00