docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md
Ref: Size: 4.9 KiB
# Performance Benchmarking Design
## Goal
Establish a reproducible performance baseline for waystty before tackling known bottlenecks. The primary metric is responsiveness under real workloads — not synthetic throughput scores.
## Non-goals
- vtebench integration (rewards batching, doesn't measure latency)
- tracy GPU profiling (GPU draw cost is negligible for a terminal; CPU-side bottlenecks dominate)
- Input-to-display latency measurement (out of scope for this phase)
## Known bottlenecks (to be measured, then fixed)
1. Atlas full re-upload on any new glyph — entire atlas through staging buffer + `queueWaitIdle` stall
2. Instance buffer map/unmap on every frame — host-visible memory can stay persistently mapped
3. Atlas staging buffer allocated/freed on every upload — should be persistent
4. Atlas image layout transitions from `UNDEFINED` — should go `SHADER_READ_ONLY → TRANSFER_DST → SHADER_READ_ONLY` for incremental updates
## Module 1: Frame timing ring buffer
### Instrumented sections
Five sections timed with `std.time.Timer` on every rendered frame:
| Section | What it covers |
|---|---|
| `snapshot` | `term.snapshot()` |
| `row_rebuild` | refresh planning + dirty-row rebuild + cursor rebuild |
| `atlas_upload` | `ctx.uploadAtlas(...)` — zero when atlas is not dirty |
| `instance_upload` | `uploadInstances` / `uploadInstanceRange` |
| `gpu_submit` | fence wait + image acquire + command record + submit + present. Note: the fence wait blocks on the *previous* frame's GPU work, so this section includes GPU execution time of frame N-1. This is correct for latency measurement (actual wall-clock cost of this phase). |
Idle frames (no render) are not recorded.
### Data structure
256-entry ring buffer of `FrameTiming` structs in `src/main.zig`. All fields are `u32` microseconds. ~6KB total. Always compiled in — timer reads are negligible overhead.
```zig
const FrameTiming = struct {
snapshot_us: u32 = 0,
row_rebuild_us: u32 = 0,
atlas_upload_us: u32 = 0,
instance_upload_us: u32 = 0,
gpu_submit_us: u32 = 0,
};
```
### Stats output
Triggered on SIGUSR1 and on clean exit. Prints to stderr:
```
=== waystty frame timing (243 frames) ===
section min avg p99 max (µs)
snapshot 2 4 15 89
row_rebuild 1 12 124 890
atlas_upload 0 180 5200 8100
instance_upload 1 6 24 71
gpu_submit 3 8 35 210
─────────────────────────────────────────
total 9 210 5400 8800
```
p99 computed by sorting a copy of the 256 values per section.
## Module 2: Bench workload
### Mechanism
When `WAYSTTY_BENCH=1` env var is set at startup, spawn `sh -c '<bench script>'` instead of `$SHELL`. Stats are dumped to stderr on exit (clean shell exit triggers the normal exit path).
### Workloads
```sh
echo warmup; sleep 0.2;
seq 1 50000;
find /usr/lib -name '*.so' 2>/dev/null | head -500;
yes 'hello world' | head -2000;
exit 0
```
- `echo warmup; sleep 0.2` — lets the atlas rasterize common ASCII before timing real workloads
- `seq` — burst of short sequential lines, tests frame batching and row rebuild
- `find` — irregular line lengths, mixed output cadence
- `yes` — high-frequency identical lines, tests the low-change-rate path
### Makefile target
```makefile
.PHONY: bench
bench: zig-out/bin/waystty
WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log
@echo "--- frame timing ---"
@grep -A 12 "waystty frame timing" bench.log
```
## Module 3: perf + flamegraph
### Build mode
`ReleaseSafe` — keeps debug symbols and frame pointers. `ReleaseFast` may omit frame pointers, producing useless perf stacks.
### Makefile target
```makefile
FLAMEGRAPH ?= flamegraph.pl
STACKCOLLAPSE ?= stackcollapse-perf.pl
.PHONY: profile
profile:
zig build -Doptimize=ReleaseSafe
perf record -g -F 999 --no-inherit -o perf.data -- \
sh -c 'WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log'
perf script -i perf.data \
| $(STACKCOLLAPSE) \
| $(FLAMEGRAPH) > flamegraph.svg
@echo "--- frame timing ---"
@grep -A 12 "waystty frame timing" bench.log
xdg-open flamegraph.svg
```
`FLAMEGRAPH` and `STACKCOLLAPSE` default to scripts in `PATH` (available via `flamegraph` package on Arch), overridable: `make profile FLAMEGRAPH=~/FlameGraph/flamegraph.pl`.
### Prerequisites
- `flamegraph` package (provides `flamegraph.pl` and `stackcollapse-perf.pl`)
- `perf` with `CAP_PERFMON` or `/proc/sys/kernel/perf_event_paranoid <= 1`
## Files changed
- `src/main.zig` — ring buffer, section timers, SIGUSR1 handler, `WAYSTTY_BENCH` env check
- `Makefile` — `bench` and `profile` targets
## Testing
- Run `make bench`, verify stats appear in bench.log
- Send SIGUSR1 to a running waystty, verify stats print to stderr
- Run `make profile`, verify flamegraph.svg opens and shows waystty frames