>> .claude/skills/pt2-bug-basher
PT2 Bug Basher
Debug test failures and runtime errors in the PyTorch 2 compiler stack (Dynamo, Inductor, AOTAutograd, FX graphs).
Workflow Summary
- Environment check -- Ask the user which conda environment to use. Verify it is active by checking
$CONDA_DEFAULT_ENV. Then runpython -c "import torch; print(torch.__version__)"to confirm torch is importable and report the version. If the environment is not active or torch cannot be imported, stop and ask the user to activate the correct environment before proceeding. - Reproduce -- Get a consistent reproduction of the failure
- Minimize -- Reduce the repro to the smallest possible standalone case. Strip away unrelated model logic, use minimal tensor shapes, and isolate the specific op or pattern that triggers the bug.
- Add a unit test -- Do this BEFORE diving into code search or root cause investigation. Add a failing test to the codebase that captures the bug. Place it in a specific, topic-appropriate test file (e.g.,
test/dynamo/test_repros.py,test/inductor/test_torchinductor.py,test/export/test_export.py). Avoidtest/dynamo/test_misc.py— it is already oversized; find a more specific test file that matches the area of the bug. Usetorch.testing._internal.common_utils.TestCaseandrun_tests. The test must fail before the fix and pass after. Having the test first keeps you grounded — you know exactly what "fixed" looks like before you start exploring the codebase. - Validate on main -- Use
EnterWorktreeto create a worktree checked out atmain. Copy the new test file into the worktree and run the test there to confirm it fails on main. If the test passes on main, stop — the test may not be capturing the right bug, or the bug may already be fixed. Exit the worktree withExitWorktree(action: remove) and return to the working branch before continuing. - Gather logs -- Run with appropriate
TORCH_LOGSsettings - Classify -- Use the Error Triage table to identify the category
- Inspect artifacts -- Check FX graphs, IR, and generated code via
TORCH_COMPILE_DEBUG=1 - Identify root cause -- Trace from the error back through the compilation pipeline
- Fix -- Apply the fix
- Verify -- Run the new unit test AND nearby related existing tests (e.g., if you changed how
is_exportingworks, also run the existingtest_is_exportingexport test). Usepytest -kto quickly run related tests by name. The task is not complete until all pass. - Self-review -- Use the
/pr-reviewskill to review your own changes before presenting them. Fix any issues it flags. - Celebrate -- Summarize the changes: explain the root cause, what was changed and why, and which tests were added/verified. Then tell the user the bug is squashed. Include a fun, varied motivational message or easter egg to keep spirits high (e.g., a pun, a quote, an ASCII art bug getting squashed). Keep it short and different each time.
Investigation Strategy
Prefer direct tools over meta_codesearch
Use Grep, Glob, and Read directly for code exploration. Do not spawn meta_codesearch agents — they are slow and expensive. The Architectural Knowledge and Key Source Files sections below should give you enough context to know where to look. A targeted Grep for a function name is always faster.
Know which compilation mode you're in
Before reading implementation code, determine the compilation mode. These share code but diverge in important ways:
torch.compile-- Dynamo + Inductor.tx.export=False, no_compiling_state_context().torch.export(strict) --tx.export=True,_compiling_state_context()active.torch.export(non-strict, the default) -- Uses Dynamo viafullgraph_capturebuttx.exportmay differ from strict._compiling_state_context()active. Checktorch._export.config.use_new_tracer_experimental— it changes which code path is used.
Distinguish trace-time vs runtime
Many PT2 bugs come from confusing these two:
- Trace-time: Inside Dynamo's symbolic interpreter. Dynamo intercepts function calls and may constant-fold them (e.g.,
is_exporting()→ConstantVariable(True)). - Runtime: Real tensors, real Python calls, module-level flags like
torch.compiler._is_exporting_flag.
When debugging, add temporary print() statements directly in the source file rather than monkey-patching from outside — dispatch chains make monkey-patching unreliable.
Gathering Information
Pick the right diagnostic tool based on the error category:
- Quick overview:
TORCH_LOGS="+dynamo,graph_breaks,recompiles" python your_script.py - Full debug artifacts:
TORCH_COMPILE_DEBUG=1 python your_script.py— createstorch_compile_debug/with FX graphs, Inductor IR, and generated code - Generated code only:
TORCH_LOGS="output_code" python your_script.py - Structured tracing:
TORCH_TRACE=/path/to/trace python your_script.pythentlparse /path/to/trace - Single-threaded (for pdb):
TORCHINDUCTOR_COMPILE_THREADS=1 python your_script.py
Error Triage
Classify the failure using the error message and traceback:
| Error Pattern | Category | Jump To |
|---|---|---|
Unsupported: ... or graph break in logs | Graph break | Graph Breaks |
BackendCompilerFailed | Inductor/backend crash | Backend Failures |
RecompileError or cache_size_limit | Recompilation | Recompilation |
| Accuracy mismatch / wrong numerical output | Accuracy | Accuracy |
InternalTorchDynamoError | Dynamo bug | Internal Errors |
| Segfault or CUDA IMA | Runtime crash | Runtime Crashes |
| Triton assertion / index out of bounds | Triton kernel bug | Triton Failures |
Debugging by Category
Graph Breaks
Graph breaks split the compiled graph into smaller subgraphs, often causing performance regressions or unexpected behavior.
Diagnosis:
TORCH_LOGS="graph_breaks" python your_script.py
Key files:
torch/_dynamo/exc.py--Unsupportedexception classtorch/_dynamo/variables/-- where most graph break decisions happen
Common causes:
- Unsupported Python constructs (data-dependent control flow, unsupported builtins)
- Tensor operations that can't be traced (in-place ops on inputs, unsupported dtypes)
- Calls to non-traceable functions
Fix approach:
- Read the graph break message to identify the unsupported operation
- Check if there's a decomposition or supported alternative
- If the operation genuinely can't be traced, consider
torch._dynamo.allow_in_graphor restructuring user code
Backend Compiler Failures
BackendCompilerFailed means Inductor (or another backend) crashed during compilation.
Diagnosis:
TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=2 python your_script.py
This generates minifier_launcher.py that isolates the minimal failing graph.
Key files:
torch/_dynamo/repro/after_aot.py-- repro/minifier for post-AOT failurestorch/_inductor/-- the backend itself
Fix approach:
- Run the minifier to get a minimal reproduction
- Inspect the FX graph (
TORCH_COMPILE_DEBUG=1) to understand what ops are involved - Check if it's a lowering issue (
torch/_inductor/lowering.py), scheduling issue, or codegen issue - Look at the generated output code if the error is in codegen
Recompilation Issues
Excessive recompilation happens when guards are too specific, causing cache misses.
Diagnosis:
TORCH_LOGS="recompiles,recompiles_verbose,guards" python your_script.py
Key config:
torch._dynamo.config.recompile_limit(default: 8)torch._dynamo.config.fail_on_recompile_limit_hit-- set toTrueto get a hard error
Common causes:
- Changing tensor shapes without marking them dynamic
- Python scalar values that change between calls
- Global state mutations between calls
Fix approach:
- Read the recompilation reason from logs
- Identify the failing guard
- Either mark the relevant dimension as dynamic with
torch._dynamo.mark_dynamic()or fix the source of guard instability
Accuracy Issues
The compiled model produces different numerical results than eager mode.
Diagnosis:
TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=4 python your_script.py
This compares compiled vs. eager with an fp64 reference and dumps a repro if accuracy fails.
Key utilities:
torch/_dynamo/debug_utils.py--same_two_models(),backend_accuracy_fails(),cast_to_fp64()torch._dynamo.config.repro_tolerance(default: 1e-3)
Fix approach:
- Get the minimal failing graph from the minifier
- Compare eager vs. compiled output at fp64 precision
- Binary search through ops to find the diverging operation
- Check for known numerical issues (reduction order, fused kernels, dtype promotions)
Internal Dynamo Errors
InternalTorchDynamoError indicates a bug in Dynamo itself.
Diagnosis:
TORCHDYNAMO_VERBOSE=1 python your_script.py
# or equivalently:
TORCH_LOGS="+dynamo" python your_script.py
Key files:
torch/_dynamo/symbolic_convert.py-- bytecode interpretertorch/_dynamo/variables/-- variable tracking systemtorch/_dynamo/guards.py-- guard generation
Fix approach:
- Get the full stack trace with
TORCHDYNAMO_VERBOSE=1 - Identify which bytecode instruction or variable type caused the crash
- Create a minimal repro (the error message often includes a minifier path)
- Debug with
TORCHINDUCTOR_COMPILE_THREADS=1and pdb if needed
Runtime Crashes
Segfaults and CUDA illegal memory access errors during execution of compiled code.
Diagnosis (make crash deterministic):
PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 python your_script.py
For CUDA IMA, add NaN checks:
TORCHINDUCTOR_NAN_ASSERTS=1 python your_script.py
For Inductor-level sync debugging:
torch._inductor.config.triton.debug_sync_kernel = True # sync after every kernel
torch._inductor.config.triton.debug_sync_graph = True # sync before/after graph
Fix approach:
- Make the crash deterministic with
PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 - Check if it's an input mismatch (shapes, devices, dtypes)
- Inspect the generated kernel code with
TORCH_LOGS="output_code" - Use
TORCHINDUCTOR_NAN_ASSERTS=1to find the first kernel producing bad values - Check for dynamic shapes issues (historically a common source of IMA)
Triton Kernel Failures
Triton assertion failures or index-out-of-bounds in generated kernels.
Diagnosis:
TORCH_LOGS="output_code,schedule" python your_script.py
Key files:
torch/_inductor/codegen/triton.py-- Triton codegentorch/_inductor/scheduler.py-- kernel fusion decisions
Fix approach:
- Get the generated Triton kernel from
output_codelogs - Check index computations for off-by-one or wrong stride calculations
- Look at the IR (
TORCH_COMPILE_DEBUG=1) to trace back to the FX op - Check if fusion decisions created invalid index combinations
Key Source Files
| File | Purpose |
|---|---|
torch/_dynamo/exc.py | Exception hierarchy and error formatting |
torch/_dynamo/debug_utils.py | Minifier support, accuracy checking, input serialization |
torch/_dynamo/repro/after_dynamo.py | Repro/minifier for Dynamo-stage failures |
torch/_dynamo/repro/after_aot.py | Repro/minifier for post-AOTAutograd failures |
torch/_dynamo/repro/aoti.py | Repro/minifier for AOTI failures |
torch/_dynamo/config.py | Dynamo config (repro levels, recompile limits) |
torch/_dynamo/variables/torch.py | Torch function handling, tracing state functions |
torch/_dynamo/variables/higher_order_ops.py | HOP tracing (cond, map, etc.) |
torch/_dynamo/symbolic_convert.py | Bytecode interpreter, InstructionTranslator |
torch/_dynamo/convert_frame.py | Frame compilation, fullgraph_capture entry point |
torch/_dynamo/functional_export.py | New export tracer (_dynamo_graph_capture_for_export) |
torch/_dynamo/eval_frame.py | torch._dynamo.export, optimize_assert |
torch/_export/_trace.py | Export pipeline (_export, _strict_export, _non_strict_export, _export_to_aten_ir) |
torch/_export/utils.py | _compiling_state_context() |
torch/compiler/__init__.py | is_compiling(), is_exporting(), runtime flags |
torch/_higher_order_ops/cond.py | torch.cond implementation and proxy tracing |
torch/_higher_order_ops/utils.py | reenter_make_fx for HOP branch tracing |
torch/_inductor/config.py | Inductor config (debug flags, trace settings) |
torch/_inductor/debug.py | DebugContext, graph visualization, IR logging |
torch/_logging/_registrations.py | All registered log aliases and artifacts |
Using the Minifier
The minifier reduces a failing graph to the smallest reproduction:
# Step 1: Generate the minifier launcher
TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=2 python your_script.py
# Step 2: Run the minifier
python minifier_launcher.py minify
# Step 3: Run the minimized repro
python minifier_launcher.py run
For accuracy issues, use level 4:
TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=4 python your_script.py
