a73x

docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md

Ref:   Size: 4.9 KiB

# Performance Benchmarking Design

## Goal

Establish a reproducible performance baseline for waystty before tackling known bottlenecks. The primary metric is responsiveness under real workloads — not synthetic throughput scores.

## Non-goals

- vtebench integration (rewards batching, doesn't measure latency)
- tracy GPU profiling (GPU draw cost is negligible for a terminal; CPU-side bottlenecks dominate)
- Input-to-display latency measurement (out of scope for this phase)

## Known bottlenecks (to be measured, then fixed)

1. Atlas full re-upload on any new glyph — entire atlas through staging buffer + `queueWaitIdle` stall
2. Instance buffer map/unmap on every frame — host-visible memory can stay persistently mapped
3. Atlas staging buffer allocated/freed on every upload — should be persistent
4. Atlas image layout transitions from `UNDEFINED` — should go `SHADER_READ_ONLY → TRANSFER_DST → SHADER_READ_ONLY` for incremental updates

## Module 1: Frame timing ring buffer

### Instrumented sections

Five sections timed with `std.time.Timer` on every rendered frame:

| Section | What it covers |
|---|---|
| `snapshot` | `term.snapshot()` |
| `row_rebuild` | refresh planning + dirty-row rebuild + cursor rebuild |
| `atlas_upload` | `ctx.uploadAtlas(...)` — zero when atlas is not dirty |
| `instance_upload` | `uploadInstances` / `uploadInstanceRange` |
| `gpu_submit` | fence wait + image acquire + command record + submit + present. Note: the fence wait blocks on the *previous* frame's GPU work, so this section includes GPU execution time of frame N-1. This is correct for latency measurement (actual wall-clock cost of this phase). |

Idle frames (no render) are not recorded.

### Data structure

256-entry ring buffer of `FrameTiming` structs in `src/main.zig`. All fields are `u32` microseconds. ~6KB total. Always compiled in — timer reads are negligible overhead.

```zig
const FrameTiming = struct {
    snapshot_us: u32 = 0,
    row_rebuild_us: u32 = 0,
    atlas_upload_us: u32 = 0,
    instance_upload_us: u32 = 0,
    gpu_submit_us: u32 = 0,
};
```

### Stats output

Triggered on SIGUSR1 and on clean exit. Prints to stderr:

```
=== waystty frame timing (243 frames) ===
section          min    avg    p99    max  (µs)
snapshot           2      4     15     89
row_rebuild        1     12    124    890
atlas_upload       0    180   5200   8100
instance_upload    1      6     24     71
gpu_submit         3      8     35    210
─────────────────────────────────────────
total              9    210   5400   8800
```

p99 computed by sorting a copy of the 256 values per section.

## Module 2: Bench workload

### Mechanism

When `WAYSTTY_BENCH=1` env var is set at startup, spawn `sh -c '<bench script>'` instead of `$SHELL`. Stats are dumped to stderr on exit (clean shell exit triggers the normal exit path).

### Workloads

```sh
echo warmup; sleep 0.2;
seq 1 50000;
find /usr/lib -name '*.so' 2>/dev/null | head -500;
yes 'hello world' | head -2000;
exit 0
```

- `echo warmup; sleep 0.2` — lets the atlas rasterize common ASCII before timing real workloads
- `seq` — burst of short sequential lines, tests frame batching and row rebuild
- `find` — irregular line lengths, mixed output cadence  
- `yes` — high-frequency identical lines, tests the low-change-rate path

### Makefile target

```makefile
.PHONY: bench
bench: zig-out/bin/waystty
	WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log
	@echo "--- frame timing ---"
	@grep -A 12 "waystty frame timing" bench.log
```

## Module 3: perf + flamegraph

### Build mode

`ReleaseSafe` — keeps debug symbols and frame pointers. `ReleaseFast` may omit frame pointers, producing useless perf stacks.

### Makefile target

```makefile
FLAMEGRAPH ?= flamegraph.pl
STACKCOLLAPSE ?= stackcollapse-perf.pl

.PHONY: profile
profile:
	zig build -Doptimize=ReleaseSafe
	perf record -g -F 999 --no-inherit -o perf.data -- \
		sh -c 'WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log'
	perf script -i perf.data \
		| $(STACKCOLLAPSE) \
		| $(FLAMEGRAPH) > flamegraph.svg
	@echo "--- frame timing ---"
	@grep -A 12 "waystty frame timing" bench.log
	xdg-open flamegraph.svg
```

`FLAMEGRAPH` and `STACKCOLLAPSE` default to scripts in `PATH` (available via `flamegraph` package on Arch), overridable: `make profile FLAMEGRAPH=~/FlameGraph/flamegraph.pl`.

### Prerequisites

- `flamegraph` package (provides `flamegraph.pl` and `stackcollapse-perf.pl`)
- `perf` with `CAP_PERFMON` or `/proc/sys/kernel/perf_event_paranoid <= 1`

## Files changed

- `src/main.zig` — ring buffer, section timers, SIGUSR1 handler, `WAYSTTY_BENCH` env check
- `Makefile` — `bench` and `profile` targets

## Testing

- Run `make bench`, verify stats appear in bench.log
- Send SIGUSR1 to a running waystty, verify stats print to stderr
- Run `make profile`, verify flamegraph.svg opens and shows waystty frames