pytorch

Commit Graph

Author	SHA1	Message	Date
laith sakka	edd9ddf73f	Propagate allow_non_graph_fake between get_fake_values_from_nodes and get_fake_values (#119731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119731 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #119314, #119435	2024-02-14 15:26:17 +00:00
cyy	87c6cd2f00	[1/N] Replace std::tie with structural binding (#119774 ) This PR replaces some std::tie calls with structural binding from C++17. This not only makes the code more compact, but also has some performance gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-14 09:25:04 +00:00
Shuqiang Zhang	a45c627f27	[c10d][flight recorder] store a copy of string in entry (#119837 ) Summary: Previously, we just store the char pointer in entry, the string is a temp object and will be destructed when we want to dump/access it. A quick fix is to store a copy of the string, but without changing the upstream char*. An alternative is to change every profilingTitle into std:string, this however would needs comprehensive overhall of the code up to the c10d::work layer above workNCCL and RecordFunction etc. We chose the first option for this change Resolve #119808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837 Approved by: https://github.com/zdevito, https://github.com/wconstab	2024-02-14 09:13:56 +00:00
Adnan Akhundov	4a50572c92	[inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867 ) Summary: When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here: `31e59766e7/torch/_inductor/ir.py (L3805-L3816)` This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion: `31e59766e7/torch/_inductor/ir.py (L3479)` Here we add a special case handling for this to unwrap `x` recursively. Test Plan: This local repro: ``` torch.compile() def f(a, b, mat1, mat2): bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1) return torch.addmm(bias, mat1, mat2) f( torch.randn(3992, 20, 40).cuda(), torch.randn(3992, 40, 192).cuda(), torch.empty(3992, 1024).cuda(), torch.empty(1024, 3840).cuda(), ) ``` with this line: `690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)` changed to `if cond(args, *kwargs):` fails before and succeeds after this PR. Differential Revision: D53743146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867 Approved by: https://github.com/xw285cornell	2024-02-14 07:50:34 +00:00
Michael Lazos	9f44274373	Add tests to verify disabled optimizers (#118919 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919 Approved by: https://github.com/janeyx99	2024-02-14 07:45:16 +00:00
Omkar Salpekar	ca55468416	Target Determinator Indexer Workflow (#118824 ) As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator), we are experimenting with using CodeLlama-powered information retrieval for target determination. The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests. This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3. Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-02-14 06:21:18 +00:00
PyTorch UpdateBot	caf9d9d7c1	[executorch hash update] update the pinned executorch hash (#119733 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119733 Approved by: https://github.com/pytorchbot	2024-02-14 06:15:25 +00:00
Yanbo Liang	54a30f6d4e	[Dynamo] Update trace_rules.py and re-enable skipped tests (#119860 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119860 Approved by: https://github.com/angelayi	2024-02-14 05:22:55 +00:00
Oguz Ulgen	8ba2675488	Fix for-loop divisibility parsing (#119859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119859 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835, #119836, #119838	2024-02-14 05:09:59 +00:00
Oguz Ulgen	1f0e4ac146	Add support for while-loops in ttir analysis (#119838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119838 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835, #119836	2024-02-14 05:09:59 +00:00
Oguz Ulgen	5ffac768f6	Add support for labels to ttir analysis (#119836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119836 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835	2024-02-14 05:09:59 +00:00
Oguz Ulgen	3f09c5ee66	Add TTIR verification (#119835 ) Make sure the TTIR generated is valid before attempting to analyze. Incorrectly written triton code would produce broken TTIR. Minor discussion on https://github.com/openai/triton/issues/3120 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119835 Approved by: https://github.com/aakhundov ghstack dependencies: #119834	2024-02-14 05:09:59 +00:00
Oguz Ulgen	b257ff80da	Add test scf.for with multi return (#119834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119834 Approved by: https://github.com/aakhundov	2024-02-14 05:09:59 +00:00
Huy Do	72bbbab70a	Add the missing test_dynamo_list_index from #119151 (D53392287) (#119854 ) D53392287 botched the export somehow and the exported PR https://github.com/pytorch/pytorch/pull/119151 didn't contain the added test. The discrepancy is showing up on diff train patch up diff D53694548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119854 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-02-14 04:10:02 +00:00
Bert Maher	563f1b9fef	[inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662 ) `triton.testing.nvsmi` invokes `nvidia-smi` as a subprocess, and Meta prod usually doesn't make nvidia-smi available. Might as well just use something that's native to torch. Differential Revision: [D53235814](https://our.internmc.facebook.com/intern/diff/D53235814/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118662 Approved by: https://github.com/jansel	2024-02-14 03:23:49 +00:00
Animesh Jain	80379ef0aa	[dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853 ) Fixes https://github.com/pytorch/pytorch/issues/119715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119853 Approved by: https://github.com/jansel	2024-02-14 03:14:42 +00:00
Kurman Karabukaev	4240304da4	[TorchElastic] Handle SystemExit with code == 0 (#119697 ) Summary: Fix for a case where --run-path option fails to exit if the script exits with non-error status code. When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit. Test Plan: cat /tmp/script.py ~~~ import sys def main(): exit_code = 1 if len(sys.argv) > 1: exit_code = int(sys.argv[1]) sys.exit(exit_code) if __name__=="__main__": main() ~~~ Case of exit code with 0 (prior behavior - never exits): torchrun --run-path /tmp/script.py 0 ~~~ [2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. (conda:pytorch) ➜ workspace echo $? 0 ~~~ Existing behavior for non-zero exit code still works: torchrun --run-path /tmp/script.py ~~~ (conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py [2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last): [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException( [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1 Traceback (most recent call last): File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main run(args) File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run elastic_launch( File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-12_09:16:25 host : kurman-mbp.dhcp.thefacebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 64668) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ~~~ Differential Revision: D53653874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697 Approved by: https://github.com/wconstab	2024-02-14 03:09:09 +00:00
Aaron Meurer	5ce305270b	Add a decomposition for isin() (#115390 ) Co-authored-by: Peter Bell <peterbell10@live.co.uk> Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390 Approved by: https://github.com/peterbell10	2024-02-14 03:03:42 +00:00
Jason Ansel	75a6d6aef7	[inductor] Support storage resizing (#119749 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749 Approved by: https://github.com/yf225 ghstack dependencies: #119647, #119671	2024-02-14 03:03:38 +00:00
Joel Schlosser	31e59766e7	Fix meta registration for _flash_attention_forward() (#119812 ) Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case. Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812 Approved by: https://github.com/drisspg	2024-02-14 02:38:53 +00:00
Huy Do	179ecab7e7	Do full checkout in lint workflow to rebuild new Docker images (#119858 ) From https://github.com/pytorch/pytorch/pull/119575, using `fetch-depth: 1` didn't work for `calculate-docker-image` when rebuilding a new one. Specifically, doing a full checkout is needed for `git rev-parse HEAD~:.ci/docker` to get the Docker tag. This shows up as a trunk failure after the recent Docker image update `507db17675` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119858 Approved by: https://github.com/PaliC, https://github.com/clee2000, https://github.com/malfet	2024-02-14 02:37:54 +00:00
Taras Tsugrii	690f54b0f5	[dynamo][nit] Cleanup analyze_kernel_mutations nits. (#119703 ) Using `extend` is more efficient and other changes are stylistic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119703 Approved by: https://github.com/Skylion007	2024-02-14 02:04:13 +00:00
Brian Hirsh	f9f0c67445	beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826 ) This extra check is needed for some more complicated parameter sizes/strides for an internal model Pull Request resolved: https://github.com/pytorch/pytorch/pull/119826 Approved by: https://github.com/albanD	2024-02-14 01:46:30 +00:00
drisspg	c9459e7f55	Update atomicMaxFloat (#119577 ) # Summary Initially reported in https://github.com/pytorch/pytorch/issues/119320 I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions: https://godbolt.org/z/3sKqEqn4M However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working ### Update: I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577 Approved by: https://github.com/yifuwang	2024-02-14 01:17:16 +00:00
suo	8e029dc616	[export] fix tuple return with symints (#119829 ) as title. Differential Revision: [D53726648](https://our.internmc.facebook.com/intern/diff/D53726648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119829 Approved by: https://github.com/zhxchen17, https://github.com/khabinov	2024-02-14 01:16:38 +00:00
PyTorch MergeBot	4a5b2cd6cb	Revert "Windows Dynamo Error Removal CI Check (#115969 )" This reverts commit `45e7af5818`. Reverted https://github.com/pytorch/pytorch/pull/115969 on behalf of https://github.com/PaliC due to this pr ended up breaking some of our periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115969#issuecomment-1942934386))	2024-02-14 01:11:46 +00:00
Jesse Cai	16369816a2	[sparse] semi-structured sparse refactor (#117302 ) Summary: This PR is a refactor of semi-structured sparsity support. deprecation: Before `torch.sparse.to_sparse_semi_structured` had a kwarg param `transposed=False`, which has been removed. This kwarg was unused and now thros a deprecation warning. Namely, I've taken the subclassing implementation that xFormers has created and brought it over to PyTorch, as part of our plan to upstream runtime 2:4 sparsity. I've also copied over all the op support that Daniel implemenented that did not depend on the fast sparsification routines, into `_sparse_semi_structured_ops.py` With this subclass, all of our internal tests pass, as well as those in xFormers. The main change is that we now define a base subclass, `SparseSemiStructuredTensor` that is inherited from for each of the specific backends. We also now can arbitrarily override the sparse dispatch table with `_load_dispatch_table()`, idea being this is still general enough where users don't need to modify pytorch source code to get their model working. This also adds in padding support and stores alg_id and fuse_transpose as flags on the tensor, instead of hardcoding them. There still remains two components in xFormers that will need to be ported over eventually: - the autograd functions (`Sparsify24`, `Sparsify24_like`) - fast sparsification routines that they rely on Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302 Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles	2024-02-14 01:10:40 +00:00
Nikita Shulga	2536c5186e	[BE] Properly mark destructor overrides (Take 2) (#119656 ) Otherwise, at least on MacOS builds are littered with: ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MTIAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~CUDAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MPSHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ ``` Likely introduced by https://github.com/pytorch/pytorch/pull/119329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656 Approved by: https://github.com/Skylion007	2024-02-14 01:05:58 +00:00
cyy	cb0886ecf2	[DeviceIndex][4/N] Use DeviceIndex in more places (#119741 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741 Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang	2024-02-14 00:29:10 +00:00
suo	b2e779868f	make internal lintrunner mypy clean (#119840 ) as title Differential Revision: [D53732505](https://our.internmc.facebook.com/intern/diff/D53732505/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119840 Approved by: https://github.com/ezyang	2024-02-14 00:25:42 +00:00
angelayi	507db17675	Update HF pin (#119717 ) Sometime between now and the previous pin update, HF introduced a ModelOutputs type, which was not pytree serializable, causing aot_compile to fail on new HF models (https://fb.workplace.com/groups/1075192433118967/permalink/1377977852840422/). With https://github.com/huggingface/transformers/pull/27871, we can now pytree serialize HF ModelOutputs types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119717 Approved by: https://github.com/desertfire	2024-02-14 00:17:16 +00:00
Ozan Aydin	b51e0246b7	sccache version update (#119554 ) Fixes #37928 `sccache` is updated to the newer version (`v0.7.4`) to fix non-cacheable calls `multiple input files` for `CUDA` builds. This should make `Cache hits (CUDA)` work as expected and improve the speed dramatically. --- Additional information: - Modified `install_sccache.bat` check structure due to GitHub Action error `Process completed with exit code 255.` - Error is occurring when freshly downloaded `sccache` is being called with `--show-stats` or `--start-server` arguments within the script - Now, it is checking file's existence and killing/deleting executable before the download - Removed `sccache-cl` since it is no longer needed with newer versions of `sccache` --- `win-vs2019-cpu-py3 / build` - `16m 27s` ![image](https://github.com/pytorch/pytorch/assets/148207261/b5628e6c-64bb-4293-9d07-480f56df44f1) `win-vs2019-cuda11.8-py3 / build` - `17m 4s` (previously ~45 mins - 1h30mins) ![image](https://github.com/pytorch/pytorch/assets/148207261/e4ab01cb-0f56-41e8-984f-110e643b9c09) Now `Cache Hits (CUDA)` hits all `304` object and the error `Non-cacheable reasons` is fixed. ![image](https://github.com/pytorch/pytorch/assets/148207261/c8c25d2e-3fc1-4edb-8982-99c1f490cb54) --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/119554 Approved by: https://github.com/malfet	2024-02-13 23:50:40 +00:00
Edward Z. Yang	be35fc9ea7	Size oblivious test for slice optimization (#119625 ) Fixes https://github.com/pytorch/pytorch/issues/119623 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119625 Approved by: https://github.com/albanD	2024-02-13 23:47:52 +00:00
Andrew Gu	d81d5f52d5	[FSDP2][ez] Replaced `groupby` with `all` for same-dtype check (#119825 ) The `groupby` logic to check if all all-gather inputs have the same dtype is not so readable. Let us use `all` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119825 Approved by: https://github.com/Skylion007 ghstack dependencies: #119550, #118136, #118223, #118755	2024-02-13 23:28:53 +00:00
Jason Ansel	cf117e37d5	Refactor THPStorage_resize_ (#119671 ) Moving code around to allow it to be reused in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671 Approved by: https://github.com/yf225 ghstack dependencies: #119647	2024-02-13 23:28:47 +00:00
albanD	ca777fbbb7	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-02-13 23:15:24 +00:00
Aaron Orenstein	e9b78f2db0	Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324 ) Improve performance of inductor searching large graphs for potential fusions. Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior. Fixes #98467 Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration). Fusion is still slow - but at least finishes. After this change the example given in #98467 has the following backend timings (on one particular CPU): eager timing: 3m:23s aot_eager timing: 4m:12s inductor timing: 22m:24s Possible future work to improve this further: 1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph. 2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324 Approved by: https://github.com/oulgen	2024-02-13 22:54:53 +00:00
Jeff Daily	ba1eb0e27f	[ROCm] upgrade CI to 6.0 (#119495 ) Co-authored-by: Jithun Nair <jithun.nair@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119495 Approved by: https://github.com/huydhn	2024-02-13 22:39:03 +00:00
blorange-amd	df9b44436a	[ROCm] Enable float16/complex32 fft tests on ROCm (#117296 ) This PR is to enable float16/complex32 fft tests on ROCm. Sample results are attached here: [test_spectral_ops_results.log](https://github.com/pytorch/pytorch/files/13908533/test_spectral_ops_results.log) test_decomp::TestDecompCUDA::test_comprehensive_fft* test_decomp::TestDecompCUDA::test_quick_fft* test_jit_fuser_te::TestNNCOpInfoCUDA::test_nnc_correctness_fft* test_meta::TestMetaCUDA::test_dispatch_meta_inplace_fft* test_meta::TestMetaCUDA::test_dispatch_meta_outplace_fft* test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_inplace_fft* test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_fft* test_meta::TestMetaCUDA::test_meta_inplace_fft* test_meta::TestMetaCUDA::test_meta_outplace_fft* test_ops::TestCommonCUDA::test_complex_half_reference_testing_fft* test_ops::TestCommonCUDA::test_python_ref__refs_fft* test_ops::TestCommonCUDA::test_python_ref_executor__refs_fft* test_ops::TestCommonCUDA::test_python_ref_meta__refs* test_ops::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft* test_schema_check::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft* test_spectral_ops::TestFFTCUDA::test_empty_fft__refs_fft* test_spectral_ops::TestFFTCUDA::test_empty_fft_fft* test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error__refs_fft* test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error_fft* test_spectral_ops::TestFFTCUDA::test_fft_round_trip_cuda* test_spectral_ops::TestFFTCUDA::test_fft_type_promotion_cuda* test_spectral_ops::TestFFTCUDA::test_fftn_round_trip_cuda* test_spectral_ops::TestFFTCUDA::test_hfftn_cuda_float16 test_spectral_ops::TestFFTCUDA::test_ihfftn_cuda_float16 test_utils::TestDeviceUtilsCUDA::test_device_mode_ops_fft Pull Request resolved: https://github.com/pytorch/pytorch/pull/117296 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-02-13 22:35:32 +00:00
Nikita Shulga	63d64c8995	[MPS] Enable more bfloat16 ops (#119738 ) Introduce conveninence inlinable `mps::supportedFloatingType` function that returns true if type is Float, Half or BFloat16 Test by running LLM inference using bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119738 Approved by: https://github.com/Skylion007	2024-02-13 22:11:00 +00:00
Nikita Shulga	eb9a3383c2	[MPS] Add naive std_mean implementation (#119777 ) By just calling `std_mps` and `mean` in sequence Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script: ```python from timeit import default_timer import torch from torch.utils.benchmark import Measurement, Timer def bench_var_mean( m, n, k, dtype = torch.float32, device:str = "cpu", ) -> Measurement: setup = f""" x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}") """ t = Timer( stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer ) return t.blocked_autorange() for x in [100, 1000]: rc = bench_var_mean(1000, x, 100, device="mps") print(f"{x:5} : {rc.mean*1e6:.2f} usec") ``` which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter. Fixes https://github.com/pytorch/pytorch/issues/119663 TODOs: - Refactor the codebase and implement proper composite function (that must be faster) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777 Approved by: https://github.com/albanD	2024-02-13 21:51:29 +00:00
Jeff Daily	ee5b59dd4b	[ROCm] CatArrayBatchedCopy performance improvement (#118685 ) Tune the grid and block sizes for ROCm. Add a contig kernel separate from aligned+contig. Verified new performance using pytorch/benchmarks/operator_benchmark. `python -m pt.cat_test --device=cuda --tag-filter all` On MI200 this improved performance on average 4%, and on MI300 14%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685 Approved by: https://github.com/malfet	2024-02-13 21:51:20 +00:00
Edward Z. Yang	6665b96ebb	Rewrite maybe_reduce more carefully for unbacked SymInt (#119562 ) Fixes https://github.com/pytorch/pytorch/issues/119476 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562 Approved by: https://github.com/albanD ghstack dependencies: #119559	2024-02-13 21:40:06 +00:00
Ke Wen	28f299a870	[c10d] Fix compilation of NCCL_EXP path (#119805 ) Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621 When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly. Cc: @kunalb @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805 Approved by: https://github.com/H-Huang	2024-02-13 21:26:59 +00:00
Aaron Gokaslan	f9200c8608	[BE][Ez]: FURB129: remove unneeded readlines() (#119796 ) Applies a refurb rule to remove any readlines() in a for loop iteration as it just creates a temporary list in memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119796 Approved by: https://github.com/ezyang	2024-02-13 21:21:22 +00:00
Guilherme Leobas	3319dbcd23	Update vmap guard to avoid recompilations (#119061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061 Approved by: https://github.com/zou3519	2024-02-13 20:50:23 +00:00
Shuqiang Zhang	abadbbc4b0	[c10d][flight recorder] remove unintended assignment of entry (#119748 ) Summary: auto& entry = entries_.at(id % max_entries_); entry = entries_.at(id % max_entries_); The above line of code has unintended consequence of invoking copy/assignment of entry objects as ref itself cannot be re-assigned. Also what could cause the crash is that the entry ref could become invalid if entries_ are resized by other threads. and this could result in 'copy to a garbage location'. The fix is to use a pointer which can be re-assigned after re-acquiring the lock Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748 Approved by: https://github.com/wconstab, https://github.com/fegin	2024-02-13 20:18:58 +00:00
Catherine Lee	34638c82a6	[mergebot] No unique behavior for facebook bot re pending jobs (#119735 ) if fb bot says merge without -f, do normal behavior and wait for pending checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/119735 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-02-13 20:07:24 +00:00
vfdev	8ec3d8e35f	Fixed FxGraphDrawer compat constructor (#119767 ) Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed: ``` File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph g = graph_drawer.FxGraphDrawer( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767 Approved by: https://github.com/eellison	2024-02-13 19:36:01 +00:00
andrewor14	8ec8d78ef2	[quant][pt2e][be] Rename eval_utils -> export_utils (#119725 ) It's not really eval_utils anymore, since we added some training related utils. Instead it should be util functions that are related to general export use cases. Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725 Approved by: https://github.com/tugsbayasgalan	2024-02-13 19:10:06 +00:00

1 2 3 4 5 ...

69523 Commits All Branches Search

69523 Commits

All Branches