Profiler

Reactor includes a profiler so you can instrument your code once and use it everywhere: locally during optimization, in deployed infrastructure, and across code changes to track regressions. It is a no-op when disabled, so instrumentation can stay in your code permanently at zero cost.

Enabling Profiling

Pass --enable-profiling when starting your model:

reactor run --runtime http --enable-profiling

By default, profiling output is written to ./profiling. Change the directory with --profiling-output-dir:

reactor run --runtime http --enable-profiling --profiling-output-dir ./my-profiling-data

Flag	Description	Default
`--enable-profiling`	Enable file-based profiling output	Off
`--profiling-output-dir`	Directory for profiling output files	`./profiling`

When profiling is not enabled, get_ctx().profiler() returns a NoOpProfiler whose methods are all no-ops. There is zero runtime cost.

Instrumenting Your Code

Access the profiler through the session context and wrap code blocks with section():

from reactor_runtime import get_ctx

def start_session(self):
    while not get_ctx().should_stop():
        with get_ctx().profiler().section("inference"):
            with get_ctx().profiler().section("preprocess"):
                latent = self.preprocess(self.current_prompt)
            with get_ctx().profiler().section("model_forward"):
                output = self.model(latent)
            with get_ctx().profiler().section("vae_decode"):
                frames = self.vae.decode(output)

        get_ctx().emit_block(frames)

Nested Sections

Sections can be nested to any depth. Each measurement records the full path through the nesting hierarchy:

with get_ctx().profiler().section("inference"):
    with get_ctx().profiler().section("diffusion"):
        with get_ctx().profiler().section("denoise_step"):
            self.denoise(latent)

This produces the following paths:

Field	Value
`section_key`	`denoise_step`
`parent_path`	`inference/diffusion`
`full_path`	`inference/diffusion/denoise_step`

The hierarchy makes it easy to see both the flat cost of individual operations and how they nest within your pipeline.

The `profile_fn` Decorator

For functions you want to profile without adding a with block inside, use the profile_fn decorator:

from reactor_runtime.profiling import profile_fn, CudaTimingMode

@profile_fn("vae_decode", cuda_timing=CudaTimingMode.EVENT)
def decode_latents(self, latents):
    return self.vae.decode(latents)

This is equivalent to wrapping the function body in with get_ctx().profiler().section("vae_decode", ...). If no session is active (e.g., during offline testing), the decorator falls back to calling the function directly with no overhead.

CUDA Timing Modes

By default, sections are timed with time.perf_counter() (CPU wall-clock time). For GPU-bound operations this can be inaccurate because GPU work is asynchronous. The profiler provides three timing modes:

from reactor_runtime.profiling.profiler import CudaTimingMode

`CudaTimingMode.NONE` (default)

CPU timing only. Use for operations that do not touch the GPU.

with get_ctx().profiler().section("json_parsing"):
    data = json.loads(response)

`CudaTimingMode.EVENT`

Records CUDA events at section entry and exit, then computes the elapsed GPU time when the session ends. This avoids synchronization during the section itself, making it the most efficient mode for single-stream GPU work.

with get_ctx().profiler().section("model_forward", cuda_timing=CudaTimingMode.EVENT):
    output = self.model(input)

`CudaTimingMode.SYNC`

Performs a full torch.cuda.synchronize() at both entry and exit, then measures wall-clock time. Gives accurate results when multiple CUDA streams are in use, at the cost of higher overhead.

with get_ctx().profiler().section("pipeline", cuda_timing=CudaTimingMode.SYNC):
    with torch.cuda.stream(stream1):
        result1 = op1(data)
    with torch.cuda.stream(stream2):
        result2 = op2(data)

Choosing a Mode

Mode	Overhead	Accuracy	Best For
`NONE`	None	CPU only	Data loading, pre/post processing, CPU-bound work
`EVENT`	~1%	Single-stream GPU	Most inference forward passes
`SYNC`	~5-20%	Multi-stream GPU	Pipelined or overlapping GPU work

If CUDA is not available (no GPU or PyTorch not installed), EVENT and SYNC automatically fall back to CPU timing.

EVENT and SYNC modes capture torch.cuda.current_device() at section entry and only instrument that device. If a section distributes work across multiple GPUs (e.g., DataParallel, pipeline parallelism), only the current device’s portion is measured. For DDP, where each process owns a single GPU, the profiler works correctly per-rank.

Output Files

When a session ends, the file backend flushes all accumulated timing data. For each session, the following files are written:

profiling/
├── profiling_<timestamp>.json             # Raw timing data (all samples)
├── profiling_<timestamp>_summary.md       # Human-readable summary with statistics
├── profiling_<timestamp>_sections.csv     # Per-section stats as CSV
├── profiling_<timestamp>_meta.json        # Run metadata (git SHA, GPU info, etc.)
└── run_<timestamp>/                       # Bundle directory with copies of all artifacts
    ├── profiling_<timestamp>.json
    ├── profiling_<timestamp>_summary.md
    ├── profiling_<timestamp>_sections.csv
    └── profiling_<timestamp>_meta.json

The <timestamp> is a Unix timestamp in milliseconds.

JSON Format

The main JSON file contains all raw sample durations so you can compute any statistics you need:

{
  "timestamp": 1700000000.0,
  "timestamp_iso": "2025-11-14T12:00:00+0000",
  "total_sections": 4,
  "total_samples": 1200,
  "sections": {
    "inference": {
      "key": "inference",
      "parent_path": "",
      "samples": [0.0312, 0.0298, 0.0305, ...],
      "count": 300,
      "total_seconds": 9.15,
      "mean_seconds": 0.0305,
      "min_seconds": 0.0280,
      "max_seconds": 0.0412
    },
    "inference/model_forward": {
      "key": "model_forward",
      "parent_path": "inference",
      "samples": [0.0251, 0.0243, ...],
      ...
    }
  }
}

Summary Markdown

The _summary.md file gives you a quick overview with percentile statistics:

Section	Count	Mean (ms)	P50 (ms)	P90 (ms)	P99 (ms)	Max (ms)
inference	300	30.5	29.8	33.1	41.2	45.0
inference/model_forward	300	25.1	24.3	27.5	35.0	38.2

It also includes Top Hot Sections ranked by mean and P99 latency, and FPS Estimates for sections whose names contain frame_pipeline or emit_block.

Metadata

The _meta.json file captures the environment so you can reproduce and compare runs:

Hostname, Python version
Git SHA and branch (if in a git repo)
PyTorch version, CUDA version, GPU device name
Model name and runtime type (from environment variables)

Visualizing Results

The runtime ships with a plotting script that reads profiling JSON files and produces charts. It requires matplotlib and numpy:

pip install matplotlib numpy

Basic Usage

Plot a single session:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/profiling_1700000000000.json

Plot all sessions in a directory:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/

Save plots to a directory instead of displaying interactively:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --output ./plots/

This generates two charts per session:

Box plot — shows the distribution (median, quartiles, outliers) of each section
Histogram — shows the frequency distribution for each section with mean, min, max, and sample count

Text Summary Only

If you do not need plots, print a text summary to the terminal:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --summary-only

Filtering Sections

Focus on specific parts of your pipeline:

# Only sections under inference/
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --include "inference/"

# Exclude IO sections
python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --exclude "io/"

Output Formats

Plots default to PNG. You can also export as PDF or SVG:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --output ./plots/ --format svg

Aggregating Multiple Sessions

When you have profiling data from multiple sessions, use --aggregate to merge them into a single combined report:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --aggregate --output ./plots/

This produces:

profiling_aggregate_summary.md — combined statistics across all sessions
profiling_aggregate_sections.csv — combined per-section stats as CSV
profiling_aggregate_boxplot.png — box plot of combined data
profiling_aggregate_histograms.png — histograms of combined data

Skipping Warmup Samples

The first few iterations of a model are often slower due to CUDA kernel compilation, memory allocation, and cache warmup. Use --steady-state-skip to exclude the first N samples per section:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ --aggregate --steady-state-skip 10

Saving a Baseline

Save aggregate results as a baseline for future comparisons:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
  --aggregate --save-baseline ./baselines/

This creates a baselines/baseline_<timestamp>/ directory with all aggregate artifacts and metadata.

Comparing Runs

Compare current profiling data against a saved baseline to detect performance regressions:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
  --aggregate --compare-to ./baselines/baseline_1700000000000/

This generates a profiling_comparison.md report that shows:

Regressions — sections whose mean or P99 latency increased beyond the threshold
Full comparison table — baseline vs. current for every section, with delta and percentage

Regression Threshold

By default, sections with >10% increase are flagged as regressions. Adjust with --regression-threshold:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
  --aggregate --compare-to ./baselines/ --regression-threshold 5.0

Top Regressions

Show only the N worst regressions:

python -m reactor_runtime.profiling.plotting.plot_profiling ./profiling/ \
  --aggregate --compare-to ./baselines/ --top-regressions 5

Visualization CLI Reference

python -m reactor_runtime.profiling.plotting.plot_profiling <path> [options]

Flag	Description	Default
`path`	Path to a profiling JSON file or directory	Required
`--output`, `-o`	Output directory for plots	Interactive display
`--format`, `-f`	Output format: `png`, `pdf`, `svg`	`png`
`--no-box`	Skip box plot generation	Off
`--no-hist`	Skip histogram generation	Off
`--summary-only`	Print text summary only, no plots	Off
`--aggregate`	Aggregate multiple files into one report	Off
`--steady-state-skip`	Skip first N warmup samples per section	`0`
`--compare-to`	Baseline path for regression comparison	None
`--regression-threshold`	Percentage threshold for regression warnings	`10.0`
`--top-regressions`	Show only top N regressions	All
`--save-baseline`	Save aggregate results as baseline in this directory	None
`--include`	Only include sections matching this prefix	All
`--exclude`	Exclude sections matching this prefix	None
`--title`	Custom title for plots	Auto-generated
`--tag`	Tag to prefix to plot titles (e.g., `PR-123`)	None

Thread Safety

The profiler uses a thread-local stack for tracking nested sections, so each thread maintains its own section hierarchy independently. You can safely call profiler().section() from different threads without interference.

Get Started

Guide

Concepts

Enabling Profiling

Instrumenting Your Code

Nested Sections

The `profile_fn` Decorator

CUDA Timing Modes

`CudaTimingMode.NONE` (default)

`CudaTimingMode.EVENT`

`CudaTimingMode.SYNC`

Choosing a Mode

Output Files

JSON Format

Summary Markdown

Metadata

Visualizing Results

Basic Usage

Text Summary Only

Filtering Sections

Output Formats

Aggregating Multiple Sessions

Skipping Warmup Samples

Saving a Baseline

Comparing Runs

Regression Threshold

Top Regressions

Visualization CLI Reference

Thread Safety

Get Started

Guide

Concepts

​Enabling Profiling

​Instrumenting Your Code

​Nested Sections

​The profile_fn Decorator

​CUDA Timing Modes

​CudaTimingMode.NONE (default)

​CudaTimingMode.EVENT

​CudaTimingMode.SYNC

​Choosing a Mode

​Output Files

​JSON Format

​Summary Markdown

​Metadata

​Visualizing Results

​Basic Usage

​Text Summary Only

​Filtering Sections

​Output Formats

​Aggregating Multiple Sessions

​Skipping Warmup Samples

​Saving a Baseline

​Comparing Runs

​Regression Threshold

​Top Regressions

​Visualization CLI Reference

​Thread Safety

Enabling Profiling

Instrumenting Your Code

Nested Sections

The `profile_fn` Decorator

CUDA Timing Modes

`CudaTimingMode.NONE` (default)

`CudaTimingMode.EVENT`

`CudaTimingMode.SYNC`

Choosing a Mode

Output Files

JSON Format

Summary Markdown

Metadata

Visualizing Results

Basic Usage

Text Summary Only

Filtering Sections

Output Formats

Aggregating Multiple Sessions

Skipping Warmup Samples

Saving a Baseline

Comparing Runs

Regression Threshold

Top Regressions

Visualization CLI Reference

Thread Safety