feat: Input/output Dump + Replay Mode for API Logging Level 10 #2206

bkryu · 2025-12-11T18:32:29Z

📌 Description

Flight Recorder: Level 10 API Logging with Tensor Dumps
This PR provides a "Flight Recorder" mechanism within FlashInfer that captures detailed API execution traces (inputs, outputs, metadata) to enable exact reproduction of bugs, performance regression testing, and offline benchmarking.

The feature is implemented as Level 10 of the existing API logging module. The expected workflow is:

Capture (Online):
- User sets FLASHINFER_LOGLEVEL=10. Then makes a FlashInfer API call.
- Before API execution: Inputs are serialized to inputs.pt (or input.safetensors) and metadata to metadata.json.
- API execution: The actual kernel is invoked.
- After API execution: Result is serialized to outputs.pt(or input.safetensors) and completion status is appended to metadata.jsonl.
Storage:
- FLASHINFER_DUMP_DIR sets the dump location (default: flashinfer_dumps).
- Each call gets a unique directory: {timestamp}_pid{process_id}_{func_name}_call{sequence}.
- Tensor data is stored using torch.save (default) or optionally safetensors.
- Metadata uses JSONL format (JSON Lines) for crash-safe, append-only logging.
- A central session.jsonl provides a quick-scan overview of all recorded API calls.
  Non-contiguous tensors are needed to reproduce certain API calls like mm_fp4().
Replay (Offline):
- User runs flashinfer replay --dir <dump_path>.
- System loads inputs, dynamically resolves the function, executes it, and compares results.

Environment Variables

Variable	Default	Description
FLASHINFER_DUMP_DIR	flashinfer_dumps	Directory to save dump files
FLASHINFER_DUMP_MAX_SIZE_GB	20	Maximum total dump size in GB
FLASHINFER_DUMP_MAX_COUNT	1000	Maximum number of API calls to dump
FLASHINFER_DUMP_INCLUDE	(empty)	Comma-separated fnmatch patterns to include
FLASHINFER_DUMP_EXCLUDE	(empty)	Comma-separated fnmatch patterns to exclude
FLASHINFER_DUMP_SAFETENSORS	0	Set to 1 to use safetensors format (no pickle, but loses stride info)

Dump Filtering
For large-scale inference with many API calls, users can filter which APIs are dumped:

# Only dump decode-related APIs
export FLASHINFER_DUMP_INCLUDE="*decode*"

# Dump everything except __init__ and plan methods
export FLASHINFER_DUMP_EXCLUDE="*.__init__,*.plan"

# Only dump run() methods from wrapper classes
export FLASHINFER_DUMP_INCLUDE="*Wrapper.run"

Filter logic:

If DUMP_INCLUDE is set, only APIs matching at least one pattern are dumped
If DUMP_EXCLUDE is set, APIs matching any pattern are skipped
Both can be combined (include filter applied first, then exclude)

Implementation Details

TTensor Serialization: Uses torch.save by default instead of raw bytes or simplified custom formats. FlashInfer kernels often rely on specific memory layouts (strides, non-contiguous views). torch.save preserves these properties faithfully, whereas reconstructing them from shape/stride metadata is error-prone and complex. An optional safetensors format is available for pickle-free serialization, though it does not preserve tensor strides.
Multi-Process Support:
- Dump directory names include the process ID (pid{PID}) for identification in multi-GPU environments.
- The replay system uses a composite registry key (process_id, self_id) to prevent object collisions when replaying dumps from multiple processes.
Stateful Replay: To support stateful APIs (e.g., BatchDecodeWithPagedKVCacheWrapper), the replay system tracks object identity across init, plan(), and run() calls. The object_registry maps (process_id, self_id) tuples to live object instances during replay.

🔍 Related Issues

#2170

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Level 10 "Flight Recorder": full input/output dumps with replayable session format and a CLI replay command reporting per-item outcomes.
Configuration
- Environment-configurable dump settings (dir, size/count limits), include/exclude filtering patterns, process-id substitution, and optional SafeTensors output.
Documentation
- Expanded docs covering dump layout, JSONL/session format, filtering examples, manual replay workflows, and replay examples.
Tests
- Comprehensive tests validating Level 10 logging, safetensors, filtering, sequencing, and CLI replay.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-11T18:32:41Z

📝 Walkthrough

Walkthrough

Adds a Level 10 "Flight Recorder": crash-safe input/output tensor dumps (torch/safetensors), include/exclude filtering, session JSONL metadata, replay_from_dump/replay_sequence, a CLI replay command, extensive docs, and tests. No public API signatures changed.

Changes

Cohort / File(s)	Summary
Flight Recorder & Replay `flashinfer/api_logging.py`, `flashinfer/__main__.py`	Implements Level 10 dumps (env-configurable), input/output serialization (torch / safetensors), include/exclude filtering, session & per-dump JSONL metadata, crash-safe dump writing, reconstruction & replay helpers, `replay_from_dump` / `replay_sequence`, and CLI `replay` command.
Documentation `docs/logging.rst`	Adds Level 10 docs: Dump Configuration, SafeTensors guidance, Dump Filtering patterns/logic, process-id substitution, dump directory/JSONL structure, crash-safe semantics, and replay/workflow examples.
Tests `tests/utils/test_logging_replay.py`	Adds comprehensive tests for Level 10 dumping and replay, include/exclude filtering, safetensors handling, session JSONL accumulation, replay verification, and CLI replay behaviors.
Benchmarks `benchmarks/routines/gemm.py`	Removes backend gating for FP4 quantization: `mat2_fp4_trtllm` (and inv scale) now computed unconditionally.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Decorator as flashinfer_api
    participant DumpSystem as Dump
    participant Storage as JSONL/SafeTensors
    participant ReplayEngine as Replay

    Client->>Decorator: call API (Level 10)
    Decorator->>DumpSystem: _dump_function_inputs()
    DumpSystem->>Storage: write session.jsonl + inputs.(pt|safetensors)
    Decorator->>Decorator: execute wrapped function
    Decorator->>DumpSystem: _dump_function_outputs()
    DumpSystem->>Storage: append outputs.(pt|safetensors) + metadata
    Client->>ReplayEngine: replay_from_dump(dump_dir) / replay_sequence
    ReplayEngine->>Storage: read session.jsonl and tensors
    ReplayEngine->>ReplayEngine: reconstruct inputs, execute, compare outputs
    ReplayEngine->>Client: report pass / fail / mismatch

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

make DeepGEMM swapAB available for linear gemm SM90 #2131: Modifies FP8 block-scale GEMM paths and tests — overlaps with FP8/quantization behaviors touched in benchmarking and low-level compute helpers.
[DSV3] Optimized Router Gemm #2019: Adds DSV3 router GEMM sources and wrappers — related to changes in GEMM/quantization code paths referenced by benchmarks.

Suggested reviewers

aleozlx
cyx-6
jiahanc
djmmoss
wenscarl
yzh119

Poem

🐰 I logged the hops of every byte,

Inputs, outputs saved at night,
Session trails in JSON rows,
Replay spins where data goes,
A rabbit's debug hop — delight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: implementation of Input/Output Dump and Replay Mode for API Logging Level 10 (Flight Recorder).
Description check	✅ Passed	The description is comprehensive and well-structured, covering purpose, workflow, environment variables, filtering logic, and implementation details. However, the test checklist boxes show tests have not been marked as added/passing.
Docstring Coverage	✅ Passed	Docstring coverage is 95.56% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-12-11T18:33:00Z

Summary of Changes

Hello @bkryu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a powerful new "Flight Recorder" mode (logging level 10) to the FlashInfer library, designed to enhance debugging and reproducibility. By enabling this mode, users can automatically dump all input and output tensors, along with call metadata, for any FlashInfer API call to disk. This allows for precise recreation of API execution environments. Complementing this, a new flashinfer replay command-line tool and Python API are provided, enabling users to easily load these dumps and re-execute the recorded API calls, with options for output comparison and handling of stateful objects. This feature is crucial for isolating bugs, verifying numerical stability, and ensuring consistent behavior across different environments.

Highlights

New Logging Level 10 (Flight Recorder): Introduces a new logging level that dumps all input and output tensors to .pt files for full reproducibility and debugging purposes.
Configurable Dump Behavior: Adds new environment variables (FLASHINFER_DUMP_DIR, FLASHINFER_DUMP_MAX_SIZE_GB, FLASHINFER_DUMP_MAX_COUNT) to control the directory, maximum size, and maximum count of generated tensor dumps.
API Call Replay Mechanism: Implements core logic to serialize function arguments (tensors and non-tensors) and deserialize them for replaying API calls, including handling stateful objects and comparing replayed outputs against original outputs.
flashinfer replay CLI Command: Adds a new command-line interface tool to easily replay single dumped API calls or sequences of calls from a specified directory, providing clear pass/fail/mismatch feedback.
Comprehensive Documentation & Tests: Updates the logging.rst documentation with detailed explanations and examples for using the new Flight Recorder and replay features, and adds a new test suite to verify its functionality across various FlashInfer APIs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a powerful 'Flight Recorder' mode (logging level 10) for dumping and replaying API calls, which is a fantastic feature for debugging and reproducibility. The implementation is very thorough, covering stateful class replays, CLI integration, and comprehensive testing. The documentation is also excellent and provides clear examples. I have one suggestion regarding Python version compatibility.

gemini-code-assist · 2025-12-11T18:34:25Z

flashinfer/api_logging.py

+            return False
+        return all(
+            _compare_results(a, e, rtol, atol)
+            for a, e in zip(actual, expected, strict=True)


The strict=True argument for zip was introduced in Python 3.10. If this project supports older Python versions, this will raise a TypeError. Since you already perform a length check on lines 520-524, the strict=True is redundant for correctness and can be safely removed to ensure broader compatibility.

Suggested change

for a, e in zip(actual, expected, strict=True)

for a, e in zip(actual, expected)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

docs/logging.rst (1)

450-450: Consider using a more realistic timestamp in the example path.

The timestamp 20251234_456789_012 contains invalid values (month 12, day 34). While this is just an example, using a valid-looking timestamp like 20251204_143217_012 would be more consistent with the other examples in the documentation.
-    dump_dir = Path("./bmm_fp8_dumps/20251234_456789_012_bmm_fp8_call0001")
+    dump_dir = Path("./bmm_fp8_dumps/20251204_143217_012_bmm_fp8_call0001")

flashinfer/api_logging.py (5)

270-272: Potential race condition when cleaning up empty directory.

If another thread/process writes to dump_dir between the size check failure and rmdir(), the operation will fail with OSError. Consider using shutil.rmtree with ignore_errors=True or wrapping in a try-except.

             # Clean up empty directory
-            dump_dir.rmdir()
+            try:
+                dump_dir.rmdir()
+            except OSError:
+                pass  # Directory may have been modified by another process
             return None

341-346: Consider using logging.exception for automatic traceback inclusion.

Per static analysis hints, logging.exception automatically includes the traceback, making the code cleaner.

     except Exception as e:
-        _logger.error(f"Failed to dump function call {func_name}: {e}")
-        import traceback
-
-        _logger.error(traceback.format_exc())
+        _logger.exception(f"Failed to dump function call {func_name}: {e}")
         return None

437-441: Same suggestion: use logging.exception for cleaner traceback logging.

     except Exception as e:
-        _logger.error(f"Failed to dump outputs to {dump_dir}: {e}")
-        import traceback
-
-        _logger.error(traceback.format_exc())
+        _logger.exception(f"Failed to dump outputs to {dump_dir}: {e}")

778-783: Consider using logging.exception for execution failures.

This would automatically include the traceback.

             except Exception as e:
-                _logger.error(f"Execution failed: {e}")
-                import traceback
-
-                _logger.error(traceback.format_exc())
+                _logger.exception(f"Execution failed: {e}")
                 result_dict["execution_error"] = str(e)

1356-1363: Consider using _logger.debug instead of _logger.info for output dump success.

Line 1360 uses _logger.info while the corresponding input dump success on line 1325 uses _logger.debug. For consistency, consider using the same log level.

             if _API_LOG_LEVEL >= 10 and dump_dir:
                 try:
                     _dump_function_outputs(dump_dir, result)
-                    _logger.info(f"Outputs dumped to: {dump_dir}")
+                    _logger.debug(f"Outputs dumped to: {dump_dir}")
                 except Exception as e:
                     _logger.error(f"[DUMP ERROR (outputs) in {func_name}]: {e}")

flashinfer/__main__.py (1)

376-397: Consider adding a count for mismatches in the summary.

Currently, mismatches and errors are both counted as "failed". For better debugging, consider tracking them separately.

             passed = 0
             failed = 0
+            mismatch = 0

             for i, res in enumerate(results):
                 # ... existing code ...
                 if "error" in res:
                     # ... existing code ...
                     failed += 1
                 elif res.get("comparison_match"):
                     # ... existing code ...
                     passed += 1
                 else:
                     click.secho(f"[{i + 1}] {dump_name}: ⚠️  Mismatch", fg="yellow")
-                    failed += 1
+                    mismatch += 1

             click.secho(
-                f"\nSummary: {passed} passed, {failed} failed/mismatch", fg="white"
+                f"\nSummary: {passed} passed, {failed} failed, {mismatch} mismatch", fg="white"
             )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a29167 and 66fe74e.

📒 Files selected for processing (4)

docs/logging.rst (3 hunks)
flashinfer/__main__.py (1 hunks)
flashinfer/api_logging.py (9 hunks)
tests/utils/test_logging_replay.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

flashinfer/__main__.py (1)

flashinfer/api_logging.py (2)

replay_sequence (797-858)

replay_from_dump (547-794)

tests/utils/test_logging_replay.py (4)

flashinfer/utils.py (1)

get_compute_capability (253-256)

flashinfer/api_logging.py (1)

replay_sequence (797-858)

flashinfer/autotuner.py (1)

autotune (251-262)

flashinfer/__main__.py (1)

cli (68-73)

flashinfer/api_logging.py (1)

include/flashinfer/logging.h (1)

logging (31-41)

🪛 Ruff (0.14.8)

flashinfer/__main__.py

403-403: Do not catch blind exception: Exception

(BLE001)

flashinfer/api_logging.py

154-154: Do not catch blind exception: Exception

(BLE001)

322-322: Do not catch blind exception: Exception

(BLE001)

341-341: Do not catch blind exception: Exception

(BLE001)

342-342: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

345-345: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

437-437: Do not catch blind exception: Exception

(BLE001)

438-438: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

441-441: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

484-484: Consider moving this statement to an else block

(TRY300)

485-485: Do not catch blind exception: Exception

(BLE001)

597-597: Avoid specifying long messages outside the exception class

(TRY003)

602-602: Avoid specifying long messages outside the exception class

(TRY003)

614-614: Avoid specifying long messages outside the exception class

(TRY003)

724-724: Consider moving this statement to an else block

(TRY300)

725-725: Do not catch blind exception: Exception

(BLE001)

726-726: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

778-778: Do not catch blind exception: Exception

(BLE001)

779-779: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

782-782: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

818-818: Avoid specifying long messages outside the exception class

(TRY003)

853-853: Do not catch blind exception: Exception

(BLE001)

855-855: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1314-1315: try-except-pass detected, consider logging the exception

(S110)

1314-1314: Do not catch blind exception: Exception

(BLE001)

1326-1326: Do not catch blind exception: Exception

(BLE001)

1327-1327: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1353-1353: Do not catch blind exception: Exception

(BLE001)

1354-1354: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1361-1361: Do not catch blind exception: Exception

(BLE001)

1362-1362: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (16)

docs/logging.rst (2)

48-51: LGTM - Clear documentation for Level 10 Flight Recorder.

The logging levels table is well-structured and clearly explains the new Level 10 capabilities.

76-100: LGTM - Dump configuration is well-documented.

The environment variables table provides clear defaults and descriptions for the Level 10 dump behavior.

flashinfer/api_logging.py (7)

54-57: Thread-safety consideration for global dump state.

The global counters _dump_count, _dump_total_size_bytes, and _dump_call_counter are accessed and modified without synchronization. In multi-threaded environments (e.g., concurrent API calls), this could lead to race conditions causing inaccurate counts or size tracking. Consider using threading.Lock if thread-safety is required, or document that Level 10 logging is not thread-safe.

102-115: Good security warning for Level 10 dumping.

The explicit warning about sensitive data and the direct use of print() ensures visibility. This is appropriate for a security-related warning.

183-184: Tensor CPU transfer may cause GPU synchronization.

The arg.cpu() call triggers a CUDA synchronization. This is expected behavior for Level 10 logging but worth noting that it will impact performance. The documentation already mentions Level 10 is for "Full Reproducibility / Debugging" which implies performance is not a concern.

473-490: LGTM - Dynamic function resolution with appropriate error handling.

The function properly handles nested class methods and logs warnings on failure without crashing.

547-594: Well-documented replay function with comprehensive parameter descriptions.

The docstring clearly explains all parameters and return values, including the stateful object replay mechanism via object_registry.

797-858: LGTM - Sequence replay with proper state management.

The function correctly maintains an object_registry across replay calls to handle stateful objects, and gracefully continues on individual failures.

1317-1328: Good crash-safe design: inputs dumped before execution.

Dumping inputs before calling the function ensures that debugging information is available even if the function crashes with a CUDA error.

flashinfer/__main__.py (2)

334-349: LGTM - Clean CLI command structure with proper validation.

The command validates the dump directory existence before proceeding and correctly distinguishes between single dumps and session directories.

403-404: Broad exception catch is appropriate here for CLI robustness.

For a CLI tool, catching all exceptions and displaying a user-friendly error message is the right approach to avoid exposing stack traces to end users.

tests/utils/test_logging_replay.py (5)

39-43: Module cleanup for reimport is necessary but has side effects.

Deleting modules from sys.modules forces reimport but may cause issues if other tests depend on cached module state. This is acceptable for isolated test fixtures.

46-77: Well-structured fixture with proper cleanup.

The fixture correctly saves and restores the original environment, and forces module reimport to pick up the new configuration.

80-114: Helper function is clear and reusable.

The verify_and_replay_dump helper encapsulates the common verification logic, making tests more readable. The filtering by function name is a useful feature for multi-call dumps.

327-348: Good integration test for CLI replay.

Testing the CLI command with CliRunner ensures the full integration path works correctly.

147-151: The review comment is incorrect. The mm_fp8 function is explicitly documented as "Only supported on Blackwell GPUs" in flashinfer/trtllm_low_latency_gemm.py (line 127), which uses the trtllm_low_latency_gemm backend that implements the actual computation. The compute capability check compute_capability[0] not in [10] correctly restricts execution to Blackwell (SM100/SM103, compute capability 10.x) and does not need adjustment. There is no Hopper (9.x) support, and the test skip message accurately reflects this hardware limitation.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66fe74e and 23c7264.

📒 Files selected for processing (1)

docs/logging.rst (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (7)

docs/logging.rst (7)

48-51: LGTM!

The table entry clearly presents the new Level 10 feature and its position in the logging hierarchy. The description effectively communicates the key benefit (full input/output dumps for reproducibility).

70-70: LGTM!

The environment variable documentation is updated consistently with the new logging level table.

150-194: Verify CLI replay output format and clarify loglevel behavior.

The example shows a comprehensive workflow, but a few details need verification:

The example output format (lines 171–194) should be verified against actual flashinfer replay CLI implementation.

Line 166 mentions setting FLASHINFER_LOGLEVEL=0 # 1 for more detailed replay results but doesn't explain the difference. Consider clarifying what level 0 vs. 1 produces during replay.

The example command (line 162) is extensive and may be difficult for users to reproduce without the exact environment. Consider adding a note about prerequisites or simplifying to a minimal reproducible example if possible.

76-101: Documentation is accurate—all dump configuration variables match implementation.

The environment variables in the documentation are correctly specified:

FLASHINFER_DUMP_DIR (str, default "flashinfer_dumps") ✓

FLASHINFER_DUMP_MAX_SIZE_GB (float, default 20) ✓

FLASHINFER_DUMP_MAX_COUNT (int, default 1000) ✓

All variable names, default values, and type specifications match the implementation in flashinfer/api_logging.py and are validated by test code.

206-251: The bmm_fp8() call at line 246 is correct. The function signature in flashinfer/gemm/gemm_base.py accepts 7 parameters (5 required positional + 2 optional), and the example provides all required arguments in the correct order: bmm_fp8(input_fp8, mat2_fp8, input_inv_s, mat2_inv_s, res_dtype, res, backend="cublas"). No changes needed.

268-306: Verify replay_from_dump API and return structure.

The reproducer script relies on the replay_from_dump() function returning a dictionary with keys "args", "kwargs", and "expected_tensors". Please verify:

The function signature matches: replay_from_dump(dump_dir, compare_outputs=True, device="cuda", run=False) (note: parameter is dump_dir, not dir_path)

The return value structure includes the expected keys ✓

The return value from res on line 292 is compatible with the bmm_fp8() call signature ✓

378-433: The reproducer script correctly uses the replay_sequence API. The function signature matches the expected parameters (root_dir and device="cuda"), and the result dictionaries contain all expected keys: "metadata" (with "function_name"), "dump_dir", "error" (optional), "comparison_match", and "execution_result". The data access patterns in the script are correct and align with the actual implementation.

coderabbitai · 2025-12-11T18:47:12Z

docs/logging.rst

+    # Create workspace and wrapper - __init__ will be logged
+    workspace_buffer = torch.empty(32 * 1024 * 1024, dtype=torch.int8, device="cuda")
+    wrapper = flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper(workspace_buffer, kv_layout)
+
+    # Plan - will be logged
+    wrapper.plan(
+        kv_indptr,
+        kv_indices,
+        kv_last_page_len,
+        num_qo_heads,
+        num_kv_heads,
+        head_dim,
+        page_size,
+        data_type=torch.float16,
+        q_data_type=torch.float16,
+    )
+
+    # Run - will be logged
+    output, lse = wrapper.run(q, kv_data, return_lse=True)
+
+    # Print a small portion of the output
+    print("Output shape:", output.shape)
+    print("Output[0, :3, :3]:")
+    print(output[0, :3, :3])
+    print("\nLSE shape:", lse.shape)
+    print("LSE[0, :5]:", lse[0, :5])


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f \( -name "*.py" -o -name "*.pyx" -o -name "*.pxd" \) | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 968

🏁 Script executed:

rg "class BatchDecodeWithPagedKVCacheWrapper" -A 30 --type=py

Repository: flashinfer-ai/flashinfer

Length of output: 1880

🏁 Script executed:

rg "def plan\(" -B 5 -A 20 --type=py | grep -A 20 "BatchDecodeWithPagedKVCache"

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

rg "def __init__" -A 15 flashinfer/decode.py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 1993

🏁 Script executed:

rg "class BatchDecodeWithPagedKVCacheWrapper" -A 100 flashinfer/decode.py | grep -E "def (plan|run|__init__)" -A 10

Repository: flashinfer-ai/flashinfer

Length of output: 529

🏁 Script executed:

sed -n '/class BatchDecodeWithPagedKVCacheWrapper/,/^class /p' flashinfer/decode.py | head -300

Repository: flashinfer-ai/flashinfer

Length of output: 13143

🏁 Script executed:

rg "def run\(" -A 30 flashinfer/decode.py | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 3186

Fix BatchDecodeWithPagedKVCacheWrapper API usage in the producer script.

The example has several API inconsistencies with the actual implementation:

Constructor parameter name (line 354): The first parameter is float_workspace_buffer, not workspace_buffer. Additionally, the tensor should be a float workspace buffer, but line 353 creates it with dtype=torch.int8 — it should be float or uint8 for a general workspace buffer.

Deprecated parameter in plan() (line 361): The data_type parameter is deprecated. Use q_data_type and kv_data_type separately instead, or rely on q_data_type alone (which defaults to torch.float16).

run() method parameter name (line 369): The second parameter is paged_kv_cache, not kv_data. The signature supports return_lse=True correctly.

🤖 Prompt for AI Agents

In docs/logging.rst around lines 351-376, update the BatchDecodeWithPagedKVCacheWrapper usage: rename the workspace tensor variable and constructor argument to float_workspace_buffer and create the tensor with an appropriate float/uint8 dtype (not torch.int8); in the wrapper.plan(...) call remove the deprecated data_type argument and supply q_data_type (and kv_data_type if needed) instead; and in wrapper.run(...) pass the paged_kv_cache as the second argument (rename kv_data to paged_kv_cache) while keeping return_lse=True. Ensure names and types match the current API.

docs/logging.rst

bkryu · 2026-01-08T01:00:40Z

/bot run

flashinfer-bot · 2026-01-08T01:01:38Z

GitLab MR !229 has been created, and the CI pipeline #41318472 is currently running. I'll report back once the pipeline job completes.

yzh119 · 2026-01-08T07:09:05Z

flashinfer/api_logging.py

+    This function:
+    1. Extracts tensors and metadata from inputs
+    2. Creates a timestamped directory
+    3. Saves inputs.pt and partial metadata.json


can we consider safetensor and jsonl (and align with flashinfer-bench's trace collection format)?

cc @yyihuang (seems your #2305 have lots of overlap with this PR), can you coordinate and reach a consensus?

This is great feedback. In the latest commit, I switched:

from metadata.json to metadata.jsonl.

from only [intput.output.pt] to giving an option to dump as safetensors via FLASHINFER_DUMP_SAFETENSORS. As we discussed, dumping via torch.pt retains all non-contiguity and custom strides while safetensors loses them, so it's likely best to keep both.

We can generate a separate jsonl file with another workflow in flashinfer-bench.

flashinfer-bot · 2026-01-08T15:39:08Z

[FAILED] Pipeline #41318472: 7/20 passed

wenscarl · 2026-01-08T17:41:47Z

docs/logging.rst

+
+**Producer Script** (``batch_decode_producer.py``):
+
+This script demonstrates logging with a stateful wrapper class that requires ``__init__``, ``plan``, and ``run`` calls.


What if in the dump folder there are multiple init, plan, and runs. Is there a way to pick any combination of them in the replay mode? e.g. --init init_0 --plan plan_3 --run run_2.

When multiple wrapper instances are created with interleaved plan()/run() calls, the system correctly tracks them:

Each dump's metadata.jsonl contains self_id (Python's id(self)) and process_id

During replay_sequence(), the object_registry uses composite keys (process_id, self_id) to maintain separate instances

This composite keys (process_id, self_id) was been added in the past few days, after we tried out this PR last month to debug the MLA IMA issue and struggled with confusion 😓

Chronological replay automatically routes each plan()/run() to the correct instanceFor example, with two interleaved instances:

__init__ (self_id=12345) → registry[(pid, 12345)] = new instance A __init__ (self_id=67890) → registry[(pid, 67890)] = new instance B plan() (pid, self_id=12345) → calls A.plan() plan() (pid, self_id=67890) → calls B.plan() run() (pid, self_id=67890) → calls B.run() run() (pid, self_id=12345) → calls A.run()

Regarding mixing the function calls from different wrapper instances (--init init_0 --plan plan_3 --run run_2), the current implementation does not support it because it would not be a valid replay. One could potentially change the (pid, self_id) in the metadata files and force a replay, but combining calls from different instances would likely fail or produce incorrect results.

The composite key is good enough! LGTM.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @flashinfer/api_logging.py:
- Around line 262-266: The broad except block that returns a fallback for
un-serializable values swallows errors and offers no diagnostics; change the
`except Exception:` to `except Exception as e:` and log the exception
(preferably via the module logger or `logging.exception`) including the value
and traceback before returning the fallback dict (which can keep `"type":
type(value).__name__` and `"repr": "<not serializable>"`). Ensure you reference
the same local symbols (`value`) and preserve the existing return shape while
adding the logging call so serialization failures are recorded.
- Around line 403-407: Replace calls to _logger.error(...) inside exception
handlers with _logger.exception(...) so the traceback is captured when
re-raising; for example change the block that logs "safetensors package not
installed. Install with: pip install safetensors" to use _logger.exception with
the same message before the raise, and apply the same change for the other
exception handlers referenced (the _logger.error occurrences around the
safetensors message and the ones at the other listed places). Keep the existing
messages and re-raise behavior unchanged—only swap _logger.error to
_logger.exception so traceback is included.

🧹 Nitpick comments (2)

benchmarks/routines/gemm.py (1)
1041-1046: Consider making trtllm quantization conditional to avoid wasted computation.

The mat2_fp4_trtllm quantization is now computed unconditionally, but it's only used when backend == "trtllm" (lines 1072, 1110). If trtllm is not in the backends list (e.g., filtered out by compute capability or user choice), this quantization step is wasted.
⚡ Suggested optimization

Move the quantization inside a conditional block or compute it lazily:
-    mat2_fp4_trtllm, mat2_inv_s_trtllm = flashinfer.nvfp4_quantize(
-        mat2,
-        global_sf_mat2,
-        sfLayout=flashinfer.SfLayout.layout_128x4,
-        do_shuffle=True,
-    )
+    # Compute trtllm-specific quantization only if needed
+    mat2_fp4_trtllm, mat2_inv_s_trtllm = None, None
+    if "trtllm" in backends:
+        mat2_fp4_trtllm, mat2_inv_s_trtllm = flashinfer.nvfp4_quantize(
+            mat2,
+            global_sf_mat2,
+            sfLayout=flashinfer.SfLayout.layout_128x4,
+            do_shuffle=True,
+        )
Then update line 1072 and 1110 to handle None gracefully, or ensure trtllm is in backends before those lines execute.
flashinfer/__main__.py (1)
455-457: Improve error reporting for replay failures.

The broad except Exception catch at line 455 hides the traceback, making it difficult to diagnose replay failures. Consider logging the full exception details.
🔧 Suggested improvement
     except Exception as e:
-        click.secho(f"❌ Replay failed: {e}", fg="red")
+        import traceback
+        click.secho(f"❌ Replay failed: {e}", fg="red")
+        if os.environ.get("FLASHINFER_LOGLEVEL", "0") != "0":
+            click.secho(traceback.format_exc(), fg="red")
This preserves the simple error message for users but provides detailed diagnostics when logging is enabled.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23c7264 and 7dac1c5.

📒 Files selected for processing (5)

benchmarks/routines/gemm.py
docs/logging.rst
flashinfer/__main__.py
flashinfer/api_logging.py
tests/utils/test_logging_replay.py

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test implementations should use flashinfer.utils functions (get_compute_capability, is_sm90a_supported, is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing with mpirun on multi-GPU systems, use the pattern: mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes - tests/conftest.py provides auto-skipping for OOM tests as a safety net but should not be relied upon

Files:

tests/utils/test_logging_replay.py

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/api_logging.py
flashinfer/__main__.py

🧬 Code graph analysis (3)

tests/utils/test_logging_replay.py (2)

flashinfer/utils.py (1)

get_compute_capability (258-261)

flashinfer/api_logging.py (2)

replay_sequence (987-1049)

replay_from_dump (700-984)

flashinfer/api_logging.py (1)

tests/utils/test_logging.py (1)

run (297-298)

benchmarks/routines/gemm.py (1)

flashinfer/fp4_quantization.py (2)

nvfp4_quantize (816-873)

SfLayout (805-812)

🪛 Ruff (0.14.10)

flashinfer/api_logging.py

262-262: Do not catch blind exception: Exception

(BLE001)

403-406: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

457-457: Do not catch blind exception: Exception

(BLE001)

482-482: Do not catch blind exception: Exception

(BLE001)

483-483: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

486-486: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

590-590: Do not catch blind exception: Exception

(BLE001)

591-591: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

594-594: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

637-637: Consider moving this statement to an else block

(TRY300)

638-638: Do not catch blind exception: Exception

(BLE001)

752-752: Avoid specifying long messages outside the exception class

(TRY003)

757-757: Avoid specifying long messages outside the exception class

(TRY003)

761-761: Avoid specifying long messages outside the exception class

(TRY003)

777-780: Avoid specifying long messages outside the exception class

(TRY003)

782-784: Avoid specifying long messages outside the exception class

(TRY003)

855-858: Avoid specifying long messages outside the exception class

(TRY003)

914-914: Consider moving this statement to an else block

(TRY300)

915-915: Do not catch blind exception: Exception

(BLE001)

916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

968-968: Do not catch blind exception: Exception

(BLE001)

969-969: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

972-972: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1008-1008: Avoid specifying long messages outside the exception class

(TRY003)

1044-1044: Do not catch blind exception: Exception

(BLE001)

1046-1046: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1505-1506: try-except-pass detected, consider logging the exception

(S110)

1505-1505: Do not catch blind exception: Exception

(BLE001)

1517-1517: Do not catch blind exception: Exception

(BLE001)

1518-1518: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1544-1544: Do not catch blind exception: Exception

(BLE001)

1545-1545: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1552-1552: Do not catch blind exception: Exception

(BLE001)

1553-1553: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

flashinfer/__main__.py

455-455: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (4)

flashinfer/api_logging.py (2)

876-933: Excellent multi-process safety design.

The use of composite keys (process_id, self_id) for the object registry (line 882) is a well-thought-out design that prevents collisions in multi-GPU/multi-process environments where different processes may have objects at the same memory address.

114-136: Good security practice with the dump warning.

The _warn_dump() function provides clear warnings about the security implications of Level 10 logging, which is essential for preventing accidental leakage of sensitive data.

docs/logging.rst (1)

1-678: Excellent documentation for Level 10 logging.

The documentation is comprehensive, well-structured, and provides practical examples for both producer and reproducer scripts. The coverage of SafeTensors trade-offs, dump filtering patterns, and JSONL format details is particularly helpful.

tests/utils/test_logging_replay.py (1)

1-1067: Comprehensive test coverage for Level 10 logging.

This test suite provides excellent coverage of the Level 10 dump/replay functionality:

Multiple FlashInfer APIs (mm_fp8, bmm_fp8, mm_fp4, prefill/decode)

Dump filtering (include, exclude, combined patterns)

JSONL format verification

SafeTensors support

CLI replay path

The use of the level10_environment fixture ensures proper isolation between tests, and the verify_and_replay_dump helper provides consistent verification logic.

coderabbitai · 2026-01-08T20:00:00Z

flashinfer/api_logging.py

+    except Exception:
+        return {
+            "type": type(value).__name__,
+            "repr": "<not serializable>",
+        }


🛠️ Refactor suggestion | 🟠 Major

Improve exception handling in serialization.

The broad except Exception: catch silently returns a fallback value without logging the error. This makes debugging serialization issues difficult.

🔧 Suggested improvement

except Exception: + _logger.warning(f"Failed to serialize {type(value).__name__}: {value!r}", exc_info=True) return { "type": type(value).__name__, "repr": "<not serializable>", }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception:

return {

"type": type(value).__name__,

"repr": "<not serializable>",

}

except Exception:

_logger.warning(f"Failed to serialize {type(value).__name__}: {value!r}", exc_info=True)

return {

"type": type(value).__name__,

"repr": "<not serializable>",

}

🧰 Tools

🪛 Ruff (0.14.10)

262-262: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In @flashinfer/api_logging.py around lines 262 - 266, The broad except block that returns a fallback for un-serializable values swallows errors and offers no diagnostics; change the `except Exception:` to `except Exception as e:` and log the exception (preferably via the module logger or `logging.exception`) including the value and traceback before returning the fallback dict (which can keep `"type": type(value).__name__` and `"repr": "<not serializable>"`). Ensure you reference the same local symbols (`value`) and preserve the existing return shape while adding the logging call so serialization failures are recorded.

coderabbitai · 2026-01-08T20:00:00Z

flashinfer/api_logging.py

+                    _logger.error(
+                        "safetensors package not installed. "
+                        "Install with: pip install safetensors"
+                    )
+                    raise


🛠️ Refactor suggestion | 🟠 Major

Use _logger.exception for better error context.

When logging errors in exception handlers, use _logger.exception() instead of _logger.error() to automatically include the traceback.

🔧 Suggested fix

except ImportError: - _logger.error( + _logger.exception( "safetensors package not installed. " "Install with: pip install safetensors" )

Apply the same pattern to other exception handlers at lines 483-486, 591-594, 916, 969-972, 1046, 1518, 1553.

🧰 Tools

🪛 Ruff (0.14.10)

403-406: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In @flashinfer/api_logging.py around lines 403 - 407, Replace calls to _logger.error(...) inside exception handlers with _logger.exception(...) so the traceback is captured when re-raising; for example change the block that logs "safetensors package not installed. Install with: pip install safetensors" to use _logger.exception with the same message before the raise, and apply the same change for the other exception handlers referenced (the _logger.error occurrences around the safetensors message and the ones at the other listed places). Keep the existing messages and re-raise behavior unchanged—only swap _logger.error to _logger.exception so traceback is included.

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In @flashinfer/api_logging.py:
- Around line 66-69: The global counters _dump_count, _dump_total_size_bytes,
and _dump_call_counter are modified without synchronization and can race in
multi-threaded use; introduce a module-level threading.Lock (e.g.,
_dump_state_lock) and guard all reads/writes/updates to those globals
(increments, size adds, and per-function counter updates) with with
_dump_state_lock: to ensure atomicity, and remove the unused
_session_jsonl_initialized variable if it is not referenced elsewhere in the
module.
- Around line 530-541: The output dumping branch that uses safetensors (guarded
by _DUMP_SAFETENSORS in the output_tensors block) lacks ImportError handling;
wrap the "from safetensors.torch import save_file" import in a try/except
ImportError and log a clear error via the same logger used elsewhere (or raise a
RuntimeError with a descriptive message) and fall back to the torch.save path if
desired; ensure the catch references the same symbols (save_file,
tensors_contiguous, dump_path) and mirrors the import-error handling used in
_dump_function_inputs so missing safetensors produces a clear message rather
than an unhandled exception.

🧹 Nitpick comments (8)

flashinfer/api_logging.py (8)

392-408: Improve safetensors import error handling.

When safetensors is not installed, the code logs an error and re-raises, but uses generic _logger.error instead of _logger.exception which would include the full traceback automatically.

As per static analysis:

             if _DUMP_SAFETENSORS:
                 # SafeTensors format: faster, no pickle, but loses stride/contiguity
                 try:
                     from safetensors.torch import save_file

                     # safetensors requires contiguous tensors
                     tensors_contiguous = {
                         k: v.contiguous() for k, v in input_tensors.items()
                     }
                     save_file(tensors_contiguous, str(dump_dir / "inputs.safetensors"))
                 except ImportError:
-                    _logger.error(
+                    _logger.exception(
                         "safetensors package not installed. "
                         "Install with: pip install safetensors"
                     )
                     raise

482-487: Use logging.exception for better error diagnostics.

The code manually calls traceback.format_exc() and logs it. Python's logging.exception() does this automatically and is the preferred pattern.

     except Exception as e:
-        _logger.error(f"Failed to dump function call {func_name}: {e}")
-        import traceback
-
-        _logger.error(traceback.format_exc())
+        _logger.exception(f"Failed to dump function call {func_name}: {e}")
         return None

Same pattern applies to other exception handlers throughout the file (lines 591-594, 969-972, etc.).

590-594: Use logging.exception instead of manual traceback formatting.

     except Exception as e:
-        _logger.error(f"Failed to dump outputs to {dump_dir}: {e}")
-        import traceback
-
-        _logger.error(traceback.format_exc())
+        _logger.exception(f"Failed to dump outputs to {dump_dir}: {e}")

626-643: Consider caching resolved functions.

The _resolve_function is called for each replay, which involves module import and attribute traversal. For replay sequences with many calls to the same function, consider caching the resolved function objects.

from functools import lru_cache

@lru_cache(maxsize=128)
def _resolve_function(module_name: str, function_name: str) -> Optional[Callable]:
    """Resolve a function from module name and function name (cached)."""
    try:
        module = importlib.import_module(module_name)
        # Handle nested function names (e.g. Class.method)
        parts = function_name.split(".")
        obj: Any = module
        for part in parts:
            obj = getattr(obj, part)
        if not callable(obj):
            return None
        return obj
    except Exception as e:
        _logger.warning(
            f"Could not resolve function {module_name}.{function_name}: {e}"
        )
        return None

Note: This caching is optional since resolution is typically fast, but could help in large replay sequences.

Based on coding guidelines, @functools.cache is preferred.

968-973: Use logging.exception for cleaner error logging.

             except Exception as e:
-                _logger.error(f"Execution failed: {e}")
-                import traceback
-
-                _logger.error(traceback.format_exc())
+                _logger.exception(f"Execution failed: {e}")
                 result_dict["execution_error"] = str(e)

750-784: Consider extracting tensor loading logic.

The tensor loading code (lines 765-788) is duplicated later in the function (lines 838-862) with slight variations. Consider extracting this into a helper function to reduce duplication.

def _load_tensors(dump_path: Path, tensor_type: str, device: str) -> Dict[str, torch.Tensor]:
    """Load tensors from dump directory (supports both .pt and .safetensors)."""
    pt_path = dump_path / f"{tensor_type}.pt"
    safetensors_path = dump_path / f"{tensor_type}.safetensors"
    
    if pt_path.exists():
        tensors = torch.load(str(pt_path), map_location="cpu")
    elif safetensors_path.exists():
        try:
            from safetensors.torch import load_file
            tensors = load_file(str(safetensors_path), device="cpu")
        except ImportError:
            raise ImportError(
                f"Dump was saved with safetensors but package not installed. "
                "Install with: pip install safetensors"
            ) from None
    else:
        raise FileNotFoundError(
            f"Neither {tensor_type}.pt nor {tensor_type}.safetensors found in {dump_path}"
        )
    
    # Move to target device
    return {key: tensor.to(device) for key, tensor in tensors.items()}

Then use: input_tensors = _load_tensors(dump_path, "inputs", device)

1505-1515: Consider logging when class detection fails.

The try-except block silently passes when class name detection fails (line 1514-1515). While this is likely rare, logging at debug level could help diagnose issues.

             if args and hasattr(args[0], "__class__"):
                 try:
                     class_name = args[0].__class__.__name__
                     if "Wrapper" in class_name or class_name in [
                         "BatchMLAPagedAttentionWrapper"
                     ]:
                         func_name = f"{class_name}.{func_name}"
                         self_id = id(args[0])
                 except Exception:
-                    pass
+                    _logger.debug(f"Could not determine class name for {f.__name__}")

1556-1563: Use logging.exception for dump errors.

             if _API_LOG_LEVEL >= 10 and dump_dir:
                 try:
                     _dump_function_outputs(dump_dir, result)
                     _logger.info(f"Outputs dumped to: {dump_dir}")
                 except Exception as e:
-                    _logger.error(f"[DUMP ERROR (outputs) in {func_name}]: {e}")
+                    _logger.exception(f"[DUMP ERROR (outputs) in {func_name}]: {e}")

Same for line 1527.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dac1c5 and 38db19f.

📒 Files selected for processing (1)

flashinfer/api_logging.py

🧰 Additional context used

📓 Path-based instructions (1)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/api_logging.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Applied to files:

flashinfer/api_logging.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `functools.cache` decorator on Python API functions to implement module-level caching and avoid recompilation

Applied to files:

flashinfer/api_logging.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Use `FLASHINFER_CUDA_ARCH_LIST` environment variable to specify target GPU architectures (e.g., '8.0 9.0a') and `FLASHINFER_NVCC_THREADS` to control parallel compilation threads

Applied to files:

flashinfer/api_logging.py

🪛 Ruff (0.14.10)

flashinfer/api_logging.py

262-262: Do not catch blind exception: Exception

(BLE001)

403-406: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

457-457: Do not catch blind exception: Exception

(BLE001)

482-482: Do not catch blind exception: Exception

(BLE001)

483-483: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

486-486: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

590-590: Do not catch blind exception: Exception

(BLE001)

591-591: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

594-594: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

637-637: Consider moving this statement to an else block

(TRY300)

638-638: Do not catch blind exception: Exception

(BLE001)

752-752: Avoid specifying long messages outside the exception class

(TRY003)

757-757: Avoid specifying long messages outside the exception class

(TRY003)

761-761: Avoid specifying long messages outside the exception class

(TRY003)

777-780: Avoid specifying long messages outside the exception class

(TRY003)

782-784: Avoid specifying long messages outside the exception class

(TRY003)

855-858: Avoid specifying long messages outside the exception class

(TRY003)

914-914: Consider moving this statement to an else block

(TRY300)

915-915: Do not catch blind exception: Exception

(BLE001)

916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

968-968: Do not catch blind exception: Exception

(BLE001)

969-969: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

972-972: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1008-1008: Avoid specifying long messages outside the exception class

(TRY003)

1044-1044: Do not catch blind exception: Exception

(BLE001)

1046-1046: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1514-1515: try-except-pass detected, consider logging the exception

(S110)

1514-1514: Do not catch blind exception: Exception

(BLE001)

1526-1526: Do not catch blind exception: Exception

(BLE001)

1527-1527: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1553-1553: Do not catch blind exception: Exception

(BLE001)

1554-1554: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1561-1561: Do not catch blind exception: Exception

(BLE001)

1562-1562: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (10)

flashinfer/api_logging.py (10)

114-136: LGTM: Comprehensive security warning for Level 10 dumping.

The warning appropriately alerts users about security implications of dumping tensors to disk and provides clear guidance on configuration options.

138-176: LGTM: Clear filtering logic with good documentation.

The include/exclude pattern filtering is well-documented and follows a sensible precedence: include patterns are checked first (if set), then exclude patterns.

230-267: LGTM: Robust serialization with appropriate truncation.

The serialization handles special types (torch.dtype, enum.Enum) and gracefully truncates long structures to prevent excessive metadata size. Good fallback to <not serializable> for problematic values.

597-624: LGTM: Robust value reconstruction.

Handles torch.dtype reconstruction correctly by parsing strings like "torch.bfloat16". Falls back gracefully when dtype cannot be resolved.

645-698: LGTM: Comprehensive result comparison logic.

The recursive comparison handles tensors (with numerical tolerances), collections, and fallback types. Good use of torch.allclose with configurable rtol and atol.

876-918: LGTM: Robust stateful replay with composite keys.

The use of (process_id, self_id) as composite keys prevents collisions in multi-GPU environments. The special handling for __init__ to instantiate objects is well-implemented.

987-1050: LGTM: Well-designed sequence replay with error resilience.

The function sorts dumps by timestamp and maintains a shared object registry across all replays. Error handling allows the sequence to continue even if individual replays fail, which is good for debugging.

1517-1528: Good crash-safe design with pre-execution dumps.

Level 10 dumps inputs before function execution, ensuring that debugging information is preserved even if the function crashes with CUDA errors or other failures.

1536-1540: LGTM: Effective level capping for logging.

Capping the effective logging level at 5 prevents Level 10's dump operations from interfering with the regular logging output format.

1125-1125: LGTM: Immediate security warning on module import.

Calling _warn_dump() at module load time ensures users are immediately aware when Level 10 dumping is enabled, before any sensitive data is processed.

coderabbitai · 2026-01-08T20:14:20Z

flashinfer/api_logging.py

 import enum
+import fnmatch
 import functools
 import inspect
+import json
 import logging
 import os
 import sys
-from typing import Any, Callable
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Callable, Dict, Tuple, Optional
 import contextlib
+import importlib
 import torch


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for threading usage in FlashInfer codebase echo "=== Searching for threading patterns ===" rg -n "threading|Thread\(|Pool\(|ThreadPool|concurrent\.futures" --type=py -g '!test*' -g '!*_test.py' | head -30 echo "" echo "=== Searching for existing global state with locks ===" rg -n "threading\.Lock|threading\.RLock|Lock()" --type=py | head -30 echo "" echo "=== Checking for multiprocessing usage ===" rg -n "multiprocessing|Process\(" --type=py -g '!test*' | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 1453

🏁 Script executed:

#!/bin/bash # Examine the api_logging.py file to understand global state echo "=== Examining flashinfer/api_logging.py structure ===" wc -l flashinfer/api_logging.py echo "" echo "=== Checking global variable declarations ===" rg -n "^[A-Za-z_][A-Za-z0-9_]*\s*=" flashinfer/api_logging.py | head -50 echo "" echo "=== Checking for any Level 10 references in codebase ===" rg -n "Level.*10|level.*10" --type=py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 3200

🏁 Script executed:

#!/bin/bash # Search for usage of api_logging module to understand threading context echo "=== Searching for imports of api_logging ===" rg -n "from flashinfer.*api_logging|import.*api_logging" --type=py | head -20 echo "" echo "=== Checking if there are any test files with threading ===" find . -name "*test*.py" -o -name "test_*.py" | xargs rg -l "threading|Thread|concurrent" 2>/dev/null | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 809

🏁 Script executed:

#!/bin/bash # Examine how global state is modified in api_logging.py echo "=== Searching for modifications to _dump_count and related globals ===" rg -n "_dump_count|_dump_total_size_bytes|_dump_call_counter|_session_jsonl_initialized" flashinfer/api_logging.py echo "" echo "=== Examining the _dump_api wrapper/decorator structure ===" grep -n "def _dump_api\|def flashinfer_api" flashinfer/api_logging.py | head -5 echo "" echo "=== Checking fused_moe/utils.py to see threading.local() pattern ===" head -100 flashinfer/fused_moe/utils.py | tail -70

Repository: flashinfer-ai/flashinfer

Length of output: 3020

🏁 Script executed:

#!/bin/bash # Check if any file writing happens without locks echo "=== Searching for file writes in api_logging that might be concurrent ===" rg -n "open\(|\.write\(|\.dump\(|jsonl" flashinfer/api_logging.py | head -40

Repository: flashinfer-ai/flashinfer

Length of output: 1844

🏁 Script executed:

#!/bin/bash # Check test file to understand if threading is tested with Level 10 echo "=== Examining test_logging_replay.py for threading scenarios ===" grep -n "thread\|Thread\|concurrent" tests/utils/test_logging_replay.py echo "" echo "=== Check the level10_environment fixture ===" sed -n '47,70p' tests/utils/test_logging_replay.py

Repository: flashinfer-ai/flashinfer

Length of output: 1361

🏁 Script executed:

#!/bin/bash # Search for all usages of _session_jsonl_initialized rg -n "_session_jsonl_initialized" flashinfer/api_logging.py -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 294

🏁 Script executed:

#!/bin/bash # Get more context around the _dump_api function where globals are modified sed -n '341,475p' flashinfer/api_logging.py | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 3395

Add thread-safety synchronization for Level 10 dumping.

The implementation uses unprotected module-level global state that is modified during runtime. Confirmed issues:

Race conditions on global counters: _dump_count (line 350, 471) and _dump_total_size_bytes (line 381, 472, 580) are read and written without synchronization. Concurrent calls from multiple threads can corrupt counters and bypass size/count limits.

Dictionary race condition: _dump_call_counter (lines 359-362) undergoes read-modify-write operations without synchronization, risking KeyError or incorrect call sequence numbers.

Concurrent file writes: _append_to_jsonl() (line 189) opens files in append mode without locks. Concurrent calls can corrupt session.jsonl and metadata.jsonl JSONL formatting.

FlashInfer does use threading (ThreadPoolExecutor in artifacts.py). Add threading.Lock() to protect critical sections, or use threading.local() (already used in fused_moe/utils.py) if per-thread state is appropriate. Additionally, _session_jsonl_initialized (line 69) appears unused—verify and remove if not needed.

coderabbitai · 2026-01-08T20:14:20Z

flashinfer/api_logging.py

+_dump_count = 0
+_dump_total_size_bytes = 0
+_dump_call_counter = {}  # Track call count per function
+_session_jsonl_initialized = False  # Track if session.jsonl header was written


⚠️ Potential issue | 🟠 Major

Review multi-process safety and thread safety of global state.

The global counters _dump_count, _dump_total_size_bytes, and _dump_call_counter are not protected by locks and may have race conditions in multi-threaded environments. While each process gets its own copy (by design), within a single process multiple threads could corrupt these counters.

Additionally, _session_jsonl_initialized is declared but never used in the code.

Consider these improvements

If multi-threading is expected, protect global state with locks:

import threading _dump_state_lock = threading.Lock() _dump_count = 0 _dump_total_size_bytes = 0 _dump_call_counter = {}

Then wrap modifications:

with _dump_state_lock: _dump_count += 1 _dump_total_size_bytes += input_size

Remove unused _session_jsonl_initialized variable if not needed.

🤖 Prompt for AI Agents

In @flashinfer/api_logging.py around lines 66 - 69, The global counters _dump_count, _dump_total_size_bytes, and _dump_call_counter are modified without synchronization and can race in multi-threaded use; introduce a module-level threading.Lock (e.g., _dump_state_lock) and guard all reads/writes/updates to those globals (increments, size adds, and per-function counter updates) with with _dump_state_lock: to ensure atomicity, and remove the unused _session_jsonl_initialized variable if it is not referenced elsewhere in the module.

coderabbitai · 2026-01-08T20:14:20Z

flashinfer/api_logging.py

+def _append_to_jsonl(filepath: Path, record: Dict[str, Any]) -> None:
+    """
+    Append a JSON record as a single line to a JSONL file.
+
+    Parameters
+    ----------
+    filepath : Path
+        Path to the JSONL file
+    record : Dict[str, Any]
+        Record to append (will be serialized as single-line JSON)
+    """
+    with open(filepath, "a") as f:
+        f.write(json.dumps(record) + "\n")
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and find the relevant code cd /repo && find . -type f -name "api_logging.py" | head -5

Repository: flashinfer-ai/flashinfer

Length of output: 123

🏁 Script executed:

# Check for threading/concurrent constructs in the codebase rg -n "(threading|Thread|Pool|concurrent|lock|Lock)" --type=py | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 2681

🏁 Script executed:

# Look at the full context of api_logging.py around the function wc -l flashinfer/api_logging.py

Repository: flashinfer-ai/flashinfer

Length of output: 98

🏁 Script executed:

# Read the api_logging.py file to understand context and usage cat -n flashinfer/api_logging.py | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 4350

🏁 Script executed:

# Find all calls to _append_to_jsonl rg -n "_append_to_jsonl" --type=py -B2 -A2

Repository: flashinfer-ai/flashinfer

Length of output: 1628

🏁 Script executed:

# Check for any @flashinfer_api or logging setup that might indicate concurrent usage rg -n "@flashinfer_api|FLASHINFER_LOGLEVEL" --type=py -B2 -A2 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 3870

🏁 Script executed:

# See more context around session.jsonl usage and _DUMP_DIR initialization cat -n flashinfer/api_logging.py | sed -n '430,490p'

Repository: flashinfer-ai/flashinfer

Length of output: 2674

🏁 Script executed:

# Check if session.jsonl has process ID substitution or if it's shared cat -n flashinfer/api_logging.py | sed -n '46,52p'

Repository: flashinfer-ai/flashinfer

Length of output: 516

🏁 Script executed:

# Search for any locking mechanisms (fcntl, threading.Lock, etc.) rg -n "(fcntl|flock|Lock|RLock|Semaphore|threading\.)" flashinfer/api_logging.py

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check the actual dump directory initialization and whether it uses %i cat -n flashinfer/api_logging.py | sed -n '200,250p'

Repository: flashinfer-ai/flashinfer

Length of output: 1898

🏁 Script executed:

# Look at the broader context of where _append_to_jsonl is called from cat -n flashinfer/api_logging.py | sed -n '440,480p'

Repository: flashinfer-ai/flashinfer

Length of output: 1903

🏁 Script executed:

# See if there's a test or documentation about thread safety rg -l "thread|Thread|concurrent" tests/ --type=py | xargs grep -l "logging\|api_logging" 2>/dev/null | head -5

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for documentation or decorators that describe thread safety cat -n flashinfer/api_logging.py | sed -n '550,650p'

Repository: flashinfer-ai/flashinfer

Length of output: 4488

🏁 Script executed:

# Search for the @flashinfer_api decorator definition rg -n "@flashinfer_api|def flashinfer_api" flashinfer/api_logging.py -A15

Repository: flashinfer-ai/flashinfer

Length of output: 1660

🏁 Script executed:

# Look at the full context of _dump_function_inputs to understand the context better cat -n flashinfer/api_logging.py | sed -n '1,50p'

Repository: flashinfer-ai/flashinfer

Length of output: 1887

🏁 Script executed:

# Check if there are any comments about thread safety or limitations rg -n "thread|Thread|safe|concurrent|race|lock" flashinfer/api_logging.py

Repository: flashinfer-ai/flashinfer

Length of output: 3025

🏁 Script executed:

# Look at how dump_dir is created - whether it uses process ID cat -n flashinfer/api_logging.py | sed -n '330,380p'

Repository: flashinfer-ai/flashinfer

Length of output: 2236

Add file locking to _append_to_jsonl for thread/process safety.

The function appends to session.jsonl (line 468, 577) without synchronization. When multiple processes or threads write to this shared file, records may be interleaved or corrupted. Per-process metadata.jsonl files are safe due to unique paths, but the central session.jsonl requires protection.

Implement file locking using fcntl.flock (POSIX) or msvcrt.locking (Windows), or document that Level 10 logging is not suitable for multi-threaded/multi-process scenarios.

coderabbitai · 2026-01-08T20:14:20Z

flashinfer/api_logging.py

+    global _dump_count, _dump_total_size_bytes
+
+    # Check include/exclude filters first (before any work is done)
+    if not _should_dump_function(func_name):
+        _logger.debug(
+            f"Skipping dump for {func_name} (filtered by include/exclude patterns)"
+        )
+        return None
+
+    if _dump_count >= _DUMP_MAX_COUNT:
+        _logger.warning(
+            f"Dump limit reached ({_DUMP_MAX_COUNT} dumps). Skipping dump for {func_name}. "
+            f"Increase FLASHINFER_DUMP_MAX_COUNT if needed."
+        )
+        return None
+
+    try:
+        # Get call counter for this function
+        if func_name not in _dump_call_counter:
+            _dump_call_counter[func_name] = 0
+        _dump_call_counter[func_name] += 1
+        call_seq = _dump_call_counter[func_name]
+
+        # Create dump directory structure
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")[
+            :-3
+        ]  # Include milliseconds
+        pid = os.getpid()
+        dump_name = f"{timestamp}_pid{pid}_{func_name}_call{call_seq:04d}"
+        dump_dir = Path(_DUMP_DIR) / dump_name
+        dump_dir.mkdir(parents=True, exist_ok=True)
+
+        # Extract tensors and metadata from inputs
+        input_tensors, input_metadata = _extract_tensors_and_metadata(args, kwargs)
+
+        # Calculate input size
+        input_size = sum(_get_tensor_size_bytes(t) for t in input_tensors.values())
+
+        # Check size limit (conservative check - only inputs for now)
+        max_size_bytes = _DUMP_MAX_SIZE_GB * 1024 * 1024 * 1024
+        if _dump_total_size_bytes + input_size > max_size_bytes:
+            _logger.warning(
+                f"Dump size limit reached ({_DUMP_MAX_SIZE_GB} GB). Skipping dump for {func_name}. "
+                f"Increase FLASHINFER_DUMP_MAX_SIZE_GB if needed."
+            )
+            # Clean up empty directory
+            dump_dir.rmdir()
+            return None


⚠️ Potential issue | 🟠 Major

Race condition in dump limit checks.

The size and count limit checks (lines 350-356, 381-388) read and compare global state without synchronization. In multi-threaded scenarios, multiple threads could pass these checks simultaneously and exceed the limits.

Example scenario:

Thread A checks _dump_count < _DUMP_MAX_COUNT (999 < 1000) → passes

Thread B checks _dump_count < _DUMP_MAX_COUNT (999 < 1000) → passes

Both threads proceed to dump, resulting in 1001 total dumps

Consider:

Using atomic operations or locks around limit checks and updates

Accepting slight over-limit behavior and documenting it

Documenting that Level 10 is not thread-safe

Related to earlier comment about global state thread safety.

coderabbitai · 2026-01-08T20:14:20Z

flashinfer/api_logging.py

+        if output_tensors:
+            if _DUMP_SAFETENSORS:
+                # SafeTensors format: faster, no pickle, but loses stride/contiguity
+                from safetensors.torch import save_file
+
+                tensors_contiguous = {
+                    k: v.contiguous() for k, v in output_tensors.items()
+                }
+                save_file(tensors_contiguous, str(dump_path / "outputs.safetensors"))
+            else:
+                # torch.save format: preserves stride/contiguity
+                torch.save(output_tensors, dump_path / "outputs.pt")


⚠️ Potential issue | 🟡 Minor

Missing ImportError handling for safetensors in output dump.

Unlike _dump_function_inputs (lines 402-407), this code doesn't catch ImportError when importing safetensors. If the package is missing, an unhandled exception will be raised instead of a clear error message.

Add consistent error handling

if output_tensors: if _DUMP_SAFETENSORS: # SafeTensors format: faster, no pickle, but loses stride/contiguity - from safetensors.torch import save_file - - tensors_contiguous = { - k: v.contiguous() for k, v in output_tensors.items() - } - save_file(tensors_contiguous, str(dump_path / "outputs.safetensors")) + try: + from safetensors.torch import save_file + + tensors_contiguous = { + k: v.contiguous() for k, v in output_tensors.items() + } + save_file(tensors_contiguous, str(dump_path / "outputs.safetensors")) + except ImportError: + _logger.exception( + "safetensors package not installed. " + "Install with: pip install safetensors" + ) + raise else:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if output_tensors:

if _DUMP_SAFETENSORS:

# SafeTensors format: faster, no pickle, but loses stride/contiguity

from safetensors.torch import save_file

tensors_contiguous = {

k: v.contiguous() for k, v in output_tensors.items()

}

save_file(tensors_contiguous, str(dump_path / "outputs.safetensors"))

else:

# torch.save format: preserves stride/contiguity

torch.save(output_tensors, dump_path / "outputs.pt")

if output_tensors:

if _DUMP_SAFETENSORS:

# SafeTensors format: faster, no pickle, but loses stride/contiguity

try:

from safetensors.torch import save_file

tensors_contiguous = {

k: v.contiguous() for k, v in output_tensors.items()

}

save_file(tensors_contiguous, str(dump_path / "outputs.safetensors"))

except ImportError:

_logger.exception(

"safetensors package not installed. "

"Install with: pip install safetensors"

)

raise

else:

# torch.save format: preserves stride/contiguity

torch.save(output_tensors, dump_path / "outputs.pt")

🤖 Prompt for AI Agents

In @flashinfer/api_logging.py around lines 530 - 541, The output dumping branch that uses safetensors (guarded by _DUMP_SAFETENSORS in the output_tensors block) lacks ImportError handling; wrap the "from safetensors.torch import save_file" import in a try/except ImportError and log a clear error via the same logger used elsewhere (or raise a RuntimeError with a descriptive message) and fall back to the torch.save path if desired; ensure the catch references the same symbols (save_file, tensors_contiguous, dump_path) and mirrors the import-error handling used in _dump_function_inputs so missing safetensors produces a clear message rather than an unhandled exception.

bkryu · 2026-01-08T20:20:51Z

/bot run

flashinfer-bot · 2026-01-08T20:21:27Z

GitLab MR !229 has been updated with latest changes, and the CI pipeline #41372000 is currently running. I'll report back once the pipeline job completes.

…nts areinterleaved

bkryu · 2026-01-09T00:06:40Z

/bot stop

flashinfer-bot · 2026-01-09T00:07:04Z

The GitLab CI pipeline #41372000 has been cancelled.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

flashinfer/api_logging.py (2)
307-488: Input dumping implementation is solid.

The function correctly implements crash-safe dumping with proper checks:

Early filtering and limit checks before I/O

Size validation before saving

Comprehensive metadata collection

Proper error handling

One optional improvement: Consider using _logger.exception() instead of _logger.error() at lines 403 and 483 to automatically include stack traces.
📝 Optional: Use logging.exception for better diagnostics
                 except ImportError:
-                    _logger.error(
+                    _logger.exception(
                         "safetensors package not installed. "
                         "Install with: pip install safetensors"
                     )
And similarly at line 483:
     except Exception as e:
-        _logger.error(f"Failed to dump function call {func_name}: {e}")
+        _logger.exception(f"Failed to dump function call {func_name}: {e}")
         import traceback
-
-        _logger.error(traceback.format_exc())
Note: When using logging.exception(), the explicit traceback logging becomes redundant.
490-595: Output dumping correctly complements input dumping.

The function properly appends outputs to existing dumps and updates both per-dump and session metadata. The crash-safe design (inputs first, outputs after) is correct.

Same optional suggestion as above: consider using _logger.exception() for better diagnostics (lines 591, 594).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38db19f and 1505266.

📒 Files selected for processing (5)

benchmarks/routines/gemm.py
docs/logging.rst
flashinfer/__main__.py
flashinfer/api_logging.py
tests/utils/test_logging_replay.py

🚧 Files skipped from review as they are similar to previous changes (1)

benchmarks/routines/gemm.py

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test implementations should use flashinfer.utils functions (get_compute_capability, is_sm90a_supported, is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing with mpirun on multi-GPU systems, use the pattern: mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes - tests/conftest.py provides auto-skipping for OOM tests as a safety net but should not be relied upon

Files:

tests/utils/test_logging_replay.py

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/__main__.py
flashinfer/api_logging.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures

Applied to files:

tests/utils/test_logging_replay.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Applied to files:

flashinfer/__main__.py
docs/logging.rst

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Applied to files:

flashinfer/__main__.py
docs/logging.rst
flashinfer/api_logging.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Use `FLASHINFER_CUDA_ARCH_LIST` environment variable to specify target GPU architectures (e.g., '8.0 9.0a') and `FLASHINFER_NVCC_THREADS` to control parallel compilation threads

Applied to files:

docs/logging.rst
flashinfer/api_logging.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `functools.cache` decorator on Python API functions to implement module-level caching and avoid recompilation

Applied to files:

flashinfer/api_logging.py

🧬 Code graph analysis (2)

tests/utils/test_logging_replay.py (4)

flashinfer/utils.py (1)

get_compute_capability (258-261)

benchmarks/routines/gemm.py (1)

to_float8 (164-170)

flashinfer/api_logging.py (2)

replay_sequence (987-1049)

replay_from_dump (700-984)

flashinfer/__main__.py (1)

cli (68-73)

flashinfer/__main__.py (1)

flashinfer/api_logging.py (2)

replay_sequence (987-1049)

replay_from_dump (700-984)

🪛 Ruff (0.14.10)

flashinfer/__main__.py

455-455: Do not catch blind exception: Exception

(BLE001)

flashinfer/api_logging.py

262-262: Do not catch blind exception: Exception

(BLE001)

403-406: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

457-457: Do not catch blind exception: Exception

(BLE001)

482-482: Do not catch blind exception: Exception

(BLE001)

483-483: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

486-486: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

590-590: Do not catch blind exception: Exception

(BLE001)

591-591: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

594-594: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

637-637: Consider moving this statement to an else block

(TRY300)

638-638: Do not catch blind exception: Exception

(BLE001)

752-752: Avoid specifying long messages outside the exception class

(TRY003)

757-757: Avoid specifying long messages outside the exception class

(TRY003)

761-761: Avoid specifying long messages outside the exception class

(TRY003)

777-780: Avoid specifying long messages outside the exception class

(TRY003)

782-784: Avoid specifying long messages outside the exception class

(TRY003)

855-858: Avoid specifying long messages outside the exception class

(TRY003)

914-914: Consider moving this statement to an else block

(TRY300)

915-915: Do not catch blind exception: Exception

(BLE001)

916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

968-968: Do not catch blind exception: Exception

(BLE001)

969-969: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

972-972: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1008-1008: Avoid specifying long messages outside the exception class

(TRY003)

1044-1044: Do not catch blind exception: Exception

(BLE001)

1046-1046: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1514-1515: try-except-pass detected, consider logging the exception

(S110)

1514-1514: Do not catch blind exception: Exception

(BLE001)

1526-1526: Do not catch blind exception: Exception

(BLE001)

1527-1527: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1553-1553: Do not catch blind exception: Exception

(BLE001)

1554-1554: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1561-1561: Do not catch blind exception: Exception

(BLE001)

1562-1562: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (14)

docs/logging.rst (1)

15-15: Documentation change looks good.

The range update from 0-5 to 0-10 correctly reflects the new Level 10 Flight Recorder feature.

flashinfer/api_logging.py (7)

50-69: Configuration setup looks good.

The environment-driven configuration with sensible defaults (20GB, 1000 dumps) provides good balance between usability and safety. Global tracking variables are appropriately initialized at module level.

114-136: Security warning is a good practice.

The warning clearly communicates the security implications of Level 10 logging and displays the current configuration, which helps users understand what will be dumped.

138-176: Filter logic is clean and well-documented.

The include-then-exclude filtering logic is straightforward and correctly implements the documented behavior. Early filtering (before any I/O) is efficient.

178-223: JSONL helpers are simple and correct.

The crash-safe JSONL format (one record per line) is a good choice for logging that survives crashes. Implementation is straightforward.

230-267: Value serialization is robust.

The function handles various Python types appropriately. The blind exception handler at line 262 is acceptable here since this is a serialization helper that should be maximally robust and return something rather than crash.

700-984: Replay implementation is well-designed.

Key strengths:

Composite registry key (process_id, self_id) correctly prevents object collisions in multi-process environments

Auto-detection of tensor format (torch.save vs safetensors) makes replay format-agnostic

Proper handling of stateful wrapper classes (__init__, plan, run methods)

Comprehensive argument reconstruction

The stateful replay logic (lines 876-934) correctly handles constructor calls and method calls with proper object registry management.

1420-1570: Decorator integration is well-implemented.

The decorator correctly integrates Level 10 dumping while maintaining backward compatibility:

Inputs dumped BEFORE execution for crash safety

Outputs dumped AFTER successful execution

Logging capped at Level 5 for Level 10 to avoid redundant output

Proper self_id capture for stateful objects

Error handling ensures dump failures don't crash the API call

The warning at lines 1424-1426 appropriately flags this as experimental.

flashinfer/__main__.py (1)

386-457: CLI replay command is well-implemented.

The implementation provides good user experience:

Clear error messages with color coding

Auto-detection of single dump vs session

Human-readable output with function names

Summary statistics for sessions

The broad exception handler at line 455 is appropriate for a CLI command to provide friendly error messages rather than stack traces.

tests/utils/test_logging_replay.py (5)

39-88: Test fixture is well-designed.

The level10_environment fixture properly:

Saves and restores environment variables

Forces module reimport to pick up configuration changes

Provides isolated test environment with tmp_path

Cleans up after each test

This ensures tests don't interfere with each other.

157-194: GPU capability check is correct.

Line 160-161 properly checks compute capability before running mm_fp8 tests. This follows the coding guideline to use get_compute_capability for architecture-specific tests.

Based on coding guidelines.

366-719: Filtering tests provide thorough coverage.

The test suite comprehensively validates dump filtering:

Include-only, exclude-only, and combined scenarios

Edge cases like no matches and multiple patterns

Proper environment isolation for each test

The consistent test structure makes them easy to maintain.

721-910: JSONL tests validate crash-safe logging.

The tests properly verify:

JSONL format (not JSON) is used

Two-line structure per call (inputs_saved, then completed)

Session accumulation across multiple API calls

Proper metadata fields in each record

This ensures the crash-safe logging design works correctly.

912-1064: SafeTensors tests are well-structured.

Both tests appropriately:

Use pytest.importorskip to handle missing dependency

Verify correct file format based on configuration

Test replay auto-detection

Check metadata consistency

The mock dump approach in test_safetensors_replay_auto_detection efficiently tests format detection without executing a full API call.

bkryu · 2026-01-09T00:13:01Z

/bot run

flashinfer-bot · 2026-01-09T00:13:18Z

GitLab MR !229 has been updated with latest changes, and the CI pipeline #41383969 is currently running. I'll report back once the pipeline job completes.

yyihuang · 2026-01-09T04:11:37Z

tests/utils/test_logging_replay.py

+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v", "-s"])


Will add another test with positional arguments only, for flashinfer-bench workflow.

flashinfer-bot · 2026-01-09T17:08:07Z

[SUCCESS] Pipeline #41383969: 18/20 passed

bkryu · 2026-01-09T21:43:31Z

@yzh119 , the PR seems ready. The original intent was to enable the dumps for debugging, but @yyihuang has expressed interest in the feature for flashinfer-bench, and will come up with a followup PR making further changes for flashinfer-bench.

Can I ask you to review once more?

yzh119

I'm good with the PR overall, this feature should be really helpful in terms of reproducing bugs and collecting traces!

bkryu requested review from aleozlx, cyx-6, nvmbreughe, wenscarl and yzh119 as code owners December 11, 2025 18:32

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

bkryu added this to the 2025 Dec milestone Dec 11, 2025

bkryu marked this pull request as draft December 11, 2025 19:11

bkryu force-pushed the api_logging_with_dump branch 2 times, most recently from 8519e13 to 2aa395d Compare December 15, 2025 19:24

bkryu mentioned this pull request Dec 17, 2025

[Feature Request] Expand API Logging to include Data Dump & Replay mode #2170

Open

bkryu force-pushed the api_logging_with_dump branch from 2aa395d to 3edb449 Compare December 18, 2025 20:56

bkryu mentioned this pull request Dec 18, 2025

agent: add CLAUDE.md and claude skills #2240

Merged

5 tasks

bkryu force-pushed the api_logging_with_dump branch from 3edb449 to f6e1922 Compare January 7, 2026 23:51

bkryu marked this pull request as ready for review January 8, 2026 00:56

bkryu requested a review from jimmyzho as a code owner January 8, 2026 00:56

bkryu changed the title ~~WIP feat: Input/output Dump + Replay Mode for API Logging Level 10~~ feat: Input/output Dump + Replay Mode for API Logging Level 10 Jan 8, 2026

bkryu self-assigned this Jan 8, 2026

yzh119 reviewed Jan 8, 2026

View reviewed changes

wenscarl reviewed Jan 8, 2026

View reviewed changes

bkryu requested review from Anerudhan and jiahanc as code owners January 8, 2026 19:54

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

yyihuang mentioned this pull request Jan 8, 2026

flashinfer-bench: add auto workload dump in decorator #2305

Closed

5 tasks

sricketts removed this from the 2025 Dec milestone Jan 8, 2026

bkryu added 11 commits January 9, 2026 00:06

First commit

59c8981

Removing unnecessary parts

c4c521e

Update

8c50aa8

Fix error where replay fails when scalar and tensor positional argume…

dcf09d1

…nts areinterleaved

Add replay for wrapper class as well

e018d2b

Cleanup

846d8e1

Add more documentation in docs

b2aa8b5

Address one comment

7c5fde7

Update include/exclude features and process ids to dump directories

df7c9d4

Add safetensors and jsonl support

a0cc860

Edit docstring

1505266

bkryu force-pushed the api_logging_with_dump branch from 38db19f to 1505266 Compare January 9, 2026 00:06

coderabbitai bot reviewed Jan 9, 2026

View reviewed changes

yyihuang reviewed Jan 9, 2026

View reviewed changes

yzh119 approved these changes Jan 13, 2026

View reviewed changes

yyihuang approved these changes Jan 13, 2026

View reviewed changes

yzh119 merged commit 0e9a89d into flashinfer-ai:main Jan 14, 2026
12 checks passed

bkryu deleted the api_logging_with_dump branch January 15, 2026 18:06

	for a, e in zip(actual, expected, strict=True)
	for a, e in zip(actual, expected)


		Producer Script (``batch_decode_producer.py``):

		This script demonstrates logging with a stateful wrapper class that requires ``__init__``, ``plan``, and ``run`` calls.



		if __name__ == "__main__":
		pytest.main([__file__, "-v", "-s"])

feat: Input/output Dump + Replay Mode for API Logging Level 10 #2206

feat: Input/output Dump + Replay Mode for API Logging Level 10 #2206

Conversation

bkryu commented Dec 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkryu commented Jan 8, 2026

Uh oh!

flashinfer-bot commented Jan 8, 2026

Uh oh!

yzh119 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkryu Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

yyihuang Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Jan 8, 2026

Uh oh!

wenscarl Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkryu Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

wenscarl Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu commented Dec 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 11, 2025 •

edited

Loading

yzh119 Jan 8, 2026 •

edited

Loading

wenscarl Jan 8, 2026 •

edited

Loading