Reactor includes a profiler so you can instrument your code once and use it everywhere:
locally during optimization, in deployed infrastructure, and across code changes to track
regressions. It is a no-op when disabled, so instrumentation can stay in your code permanently at
zero cost.
Enabling Profiling
Pass --enable-profiling when starting your model:
reactor run --runtime http --enable-profiling
By default, profiling output is written to ./profiling. Change the directory with
--profiling-output-dir:
reactor run --runtime http --enable-profiling --profiling-output-dir ./my-profiling-data
| Flag | Description | Default |
|---|
--enable-profiling | Enable file-based profiling output | Off |
--profiling-output-dir | Directory for profiling output files | ./profiling |
When profiling is not enabled, get_ctx().profiler() returns a NoOpProfiler whose methods are
all no-ops. There is zero runtime cost.
Instrumenting Your Code
Access the profiler through the session context and wrap code blocks with section():
from reactor_runtime import get_ctx
def start_session(self):
while not get_ctx().should_stop():
with get_ctx().profiler().section("inference"):
with get_ctx().profiler().section("preprocess"):
latent = self.preprocess(self.current_prompt)
with get_ctx().profiler().section("model_forward"):
output = self.model(latent)
with get_ctx().profiler().section("vae_decode"):
frames = self.vae.decode(output)
get_ctx().emit_block(frames)
Nested Sections
Sections can be nested to any depth. Each measurement records the full path through the nesting
hierarchy:
with get_ctx().profiler().section("inference"):
with get_ctx().profiler().section("diffusion"):
with get_ctx().profiler().section("denoise_step"):
self.denoise(latent)
This produces the following paths:
| Field | Value |
|---|
section_key | denoise_step |
parent_path | inference/diffusion |
full_path | inference/diffusion/denoise_step |
The hierarchy makes it easy to see both the flat cost of individual operations and how they nest
within your pipeline.
The profile_fn Decorator
For functions you want to profile without adding a with block inside, use the profile_fn
decorator:
from reactor_runtime.profiling import profile_fn, CudaTimingMode
@profile_fn("vae_decode", cuda_timing=CudaTimingMode.EVENT)
def decode_latents(self, latents):
return self.vae.decode(latents)
This is equivalent to wrapping the function body in
with get_ctx().profiler().section("vae_decode", ...). If no session is active (e.g., during
offline testing), the decorator falls back to calling the function directly with no overhead.
CUDA Timing Modes
By default, sections are timed with time.perf_counter() (CPU wall-clock time). For GPU-bound
operations this can be inaccurate because GPU work is asynchronous. The profiler provides three
timing modes:
from reactor_runtime.profiling.profiler import CudaTimingMode
CudaTimingMode.NONE (default)
CPU timing only. Use for operations that do not touch the GPU.
with get_ctx().profiler().section("json_parsing"):
data = json.loads(response)
CudaTimingMode.EVENT
Records CUDA events at section entry and exit, then computes the elapsed GPU time when the
session ends. This avoids synchronization during the section itself, making it the most efficient
mode for single-stream GPU work.
with get_ctx().profiler().section("model_forward", cuda_timing=CudaTimingMode.EVENT):
output = self.model(input)
CudaTimingMode.SYNC
Performs a full torch.cuda.synchronize() at both entry and exit, then measures wall-clock time.
Gives accurate results when multiple CUDA streams are in use, at the cost of higher overhead.
with get_ctx().profiler().section("pipeline", cuda_timing=CudaTimingMode.SYNC):
with torch.cuda.stream(stream1):
result1 = op1(data)
with torch.cuda.stream(stream2):
result2 = op2(data)
Choosing a Mode
| Mode | Overhead | Accuracy | Best For |
|---|
NONE | None | CPU only | Data loading, pre/post processing, CPU-bound work |
EVENT | ~1% | Single-stream GPU | Most inference forward passes |
SYNC | ~5-20% | Multi-stream GPU | Pipelined or overlapping GPU work |
If CUDA is not available (no GPU or PyTorch not installed), EVENT and SYNC automatically fall
back to CPU timing.
EVENT and SYNC modes capture torch.cuda.current_device() at section entry and only
instrument that device. If a section distributes work across multiple GPUs (e.g., DataParallel,
pipeline parallelism), only the current device’s portion is measured. For DDP, where each
process owns a single GPU, the profiler works correctly per-rank.
Output Files
When a session ends, the file backend flushes all accumulated timing data. For each session, the
following files are written:
profiling/
├── profiling_<timestamp>.json # Raw timing data (all samples)
├── profiling_<timestamp>_summary.md # Human-readable summary with statistics
├── profiling_<timestamp>_sections.csv # Per-section stats as CSV
├── profiling_<timestamp>_meta.json # Run metadata (git SHA, GPU info, etc.)
└── run_<timestamp>/ # Bundle directory with copies of all artifacts
├── profiling_<timestamp>.json
├── profiling_<timestamp>_summary.md
├── profiling_<timestamp>_sections.csv
└── profiling_<timestamp>_meta.json
The <timestamp> is a Unix timestamp in milliseconds.
The main JSON file contains all raw sample durations so you can compute any statistics you need:
{
"timestamp": 1700000000.0,
"timestamp_iso": "2025-11-14T12:00:00+0000",
"total_sections": 4,
"total_samples": 1200,
"sections": {
"inference": {
"key": "inference",
"parent_path": "",
"samples": [0.0312, 0.0298, 0.0305, ...],
"count": 300,
"total_seconds": 9.15,
"mean_seconds": 0.0305,
"min_seconds": 0.0280,
"max_seconds": 0.0412
},
"inference/model_forward": {
"key": "model_forward",
"parent_path": "inference",
"samples": [0.0251, 0.0243, ...],
...
}
}
}
Summary Markdown
The _summary.md file gives you a quick overview with percentile statistics:
| Section | Count | Mean (ms) | P50 (ms) | P90 (ms) | P99 (ms) | Max (ms) |
|---|
| inference | 300 | 30.5 | 29.8 | 33.1 | 41.2 | 45.0 |
| inference/model_forward | 300 | 25.1 | 24.3 | 27.5 | 35.0 | 38.2 |
It also includes Top Hot Sections ranked by mean and P99 latency, and FPS Estimates for
sections whose names contain frame_pipeline or emit_block.
The _meta.json file captures the environment so you can reproduce and compare runs:
- Hostname, Python version
- Git SHA and branch (if in a git repo)
- PyTorch version, CUDA version, GPU device name
- Model name and runtime type (from environment variables)
Visualizing Results
The runtime ships with a plotting script that reads profiling JSON files and produces charts.
It requires matplotlib and numpy:
pip install matplotlib numpy
Basic Usage
Plot a single session:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/profiling_1700000000000.json
Plot all sessions in a directory:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/
Save plots to a directory instead of displaying interactively:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --output ./plots/
This generates two charts per session:
- Box plot — shows the distribution (median, quartiles, outliers) of each section
- Histogram — shows the frequency distribution for each section with mean, min, max, and
sample count
Text Summary Only
If you do not need plots, print a text summary to the terminal:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --summary-only
Filtering Sections
Focus on specific parts of your pipeline:
# Only sections under inference/
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --include "inference/"
# Exclude IO sections
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --exclude "io/"
Plots default to PNG. You can also export as PDF or SVG:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --output ./plots/ --format svg
Aggregating Multiple Sessions
When you have profiling data from multiple sessions, use --aggregate to merge them into a single
combined report:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --aggregate --output ./plots/
This produces:
profiling_aggregate_summary.md — combined statistics across all sessions
profiling_aggregate_sections.csv — combined per-section stats as CSV
profiling_aggregate_boxplot.png — box plot of combined data
profiling_aggregate_histograms.png — histograms of combined data
Skipping Warmup Samples
The first few iterations of a model are often slower due to CUDA kernel compilation, memory
allocation, and cache warmup. Use --steady-state-skip to exclude the first N samples per
section:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --aggregate --steady-state-skip 10
Saving a Baseline
Save aggregate results as a baseline for future comparisons:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
--aggregate --save-baseline ./baselines/
This creates a baselines/baseline_<timestamp>/ directory with all aggregate artifacts and
metadata.
Comparing Runs
Compare current profiling data against a saved baseline to detect performance regressions:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
--aggregate --compare-to ./baselines/baseline_1700000000000/
This generates a profiling_comparison.md report that shows:
- Regressions — sections whose mean or P99 latency increased beyond the threshold
- Full comparison table — baseline vs. current for every section, with delta and percentage
Regression Threshold
By default, sections with >10% increase are flagged as regressions. Adjust with
--regression-threshold:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
--aggregate --compare-to ./baselines/ --regression-threshold 5.0
Top Regressions
Show only the N worst regressions:
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
--aggregate --compare-to ./baselines/ --top-regressions 5
Visualization CLI Reference
python -m reactor_runtime.profiling.plotting.plot_profiling <path> [options]
| Flag | Description | Default |
|---|
path | Path to a profiling JSON file or directory | Required |
--output, -o | Output directory for plots | Interactive display |
--format, -f | Output format: png, pdf, svg | png |
--no-box | Skip box plot generation | Off |
--no-hist | Skip histogram generation | Off |
--summary-only | Print text summary only, no plots | Off |
--aggregate | Aggregate multiple files into one report | Off |
--steady-state-skip | Skip first N warmup samples per section | 0 |
--compare-to | Baseline path for regression comparison | None |
--regression-threshold | Percentage threshold for regression warnings | 10.0 |
--top-regressions | Show only top N regressions | All |
--save-baseline | Save aggregate results as baseline in this directory | None |
--include | Only include sections matching this prefix | All |
--exclude | Exclude sections matching this prefix | None |
--title | Custom title for plots | Auto-generated |
--tag | Tag to prefix to plot titles (e.g., PR-123) | None |
Thread Safety
The profiler uses a thread-local stack for tracking nested sections, so each thread maintains its
own section hierarchy independently. You can safely call profiler().section() from different
threads without interference.