docs/superpowers/specs/2026-04-18-input-latency-bench-design.md

Ref: Size: 13.7 KiB

# Input-Latency Bench Design

## Goal

Measure waystty's keystroke-to-display latency in a reproducible, closed-loop, automated way. This complements the existing output-only bench (`docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md`), which deliberately excluded input responsiveness.

The output is two headline metrics plus a per-stage breakdown:

- **Cold latency** — terminal is idle; press a key, measure until the echo appears on screen. Tests the best-case path.
- **Hot latency** — the child process is producing continuous output at a controlled rate; press a key while output contends with the sentinel on the PTY. Measures contended-PTY responsiveness, which is where terminals diverge in feel.

## Non-goals

- Real photon latency (requires external hardware; not automatable).
- Virtual-keyboard-protocol injection through the compositor (adds compositor and xkb as variables; worth having as a separate integration check later).
- Scriptable screen-state assertions beyond sentinel detection (YAGNI; the primitives this design uses are extensible to that later).
- Changing waystty's normal present mode. MAILBOX stays the default.

## Architecture

New env var `WAYSTTY_INPUT_BENCH={cold|hot|both}` activates input-bench mode.

Grid-lock (below) is shared infrastructure — applied whenever *any* bench mode is active (`WAYSTTY_BENCH` or `WAYSTTY_INPUT_BENCH`). This retrofits the existing output-only bench so both benches have the same reproducibility guarantees.

When input-bench mode is active, waystty:

1. **Locks the grid to a known size** (see Grid-lock module below). Default 80×24, overridable via `WAYSTTY_BENCH_COLS` / `WAYSTTY_BENCH_ROWS`. This removes font/DPI/monitor variance so numbers are reproducible across machines and so hot-mode sentinel lifetime is a computed guarantee, not a hope.
2. Spawns a minimal PTY child instead of `$SHELL`:
- `cold` → `cat > /dev/null`. Reads and discards stdin so the PTY slave's input buffer stays drained; kernel PTY line discipline echoes each byte (ECHO flag on the slave). Cat's canonical-mode buffering is irrelevant — we measure kernel echo, not cat's reads.
- `hot` → `sh -c 'yes "$(printf "x%.0s" {1..500})" | pv -qL 24K'`. Produces 500-char lines rate-limited to 24 KB/s, ≈ 46 lines/sec. On an 80×24 grid this gives a sentinel lifetime of ≈30 frames at 60Hz — enough time for any reasonable presentation to land — while still exercising the render pipeline (100KB/s of parsing, atlas churn on long lines, per-frame row rebuild).
- `both` → runs cold, prints stats, terminates the cold child, runs hot.
3. **Child teardown**: at scenario end or shutdown, send SIGTERM; wait up to 100ms; send SIGKILL if still alive; `waitpid` to reap. Prevents stuck `pv` or `yes` from surviving the bench.
4. **Drops only `.key` events** from the real Wayland keyboard in bench mode. Keymap, enter/leave, modifier updates, and repeat-timer state still flow through `src/wayland.zig` so other subsystems (clipboard, paste, focus) keep working; we just don't let ambient typing land on `keyboard.event_queue`.
5. Runs the `BenchDriver` (new, in `src/bench_input.zig`) alongside the normal main loop. Driver owns the sentinel allocator, the in-flight sample, the pending-feedback map, and the stats collector.
6. On exit (N samples reached per scenario, or abort), prints stats to stderr in the existing output-bench format and exits cleanly.

MAILBOX present mode is preserved throughout. The driver handles `wp_presentation_feedback.discarded` by keeping feedback listeners live on subsequent frames, since a discarded frame's sentinel is still in the grid on the next frame.

## Module 0: Grid-lock (shared across all bench modes)

Applied whenever `WAYSTTY_BENCH` or `WAYSTTY_INPUT_BENCH` is set.

1. Size the initial window to `cols × cell_w` / `rows × cell_h` as today.
2. Advertise `xdg_toplevel.set_min_size(w, h)` and `set_max_size(w, h)` to signal that the window should not be resized. Compositors that honor these hints (most floating compositors) will leave the window alone.
3. In the main-loop resize observer (`src/main.zig:409`), if bench mode is active and the compositor forces a different size, **abort** the bench with a diagnostic on stderr:

> `waystty bench: compositor sized window to WxH, expected CxR grid. Run in a floating window or a non-tiling compositor for reproducible benchmarks.`

4. Print the achieved grid size as the first line of any bench stats output, so it's always visible alongside the numbers.

This retrofits the existing `WAYSTTY_BENCH` mode — today it starts at 80×24 but silently accepts compositor resize, so numbers are already compositor-dependent. After this change, existing output-bench numbers are guaranteed reproducible or it fails loudly.

## Module 1: Sentinel allocation

Each sample uses a unique codepoint from the Unicode Private Use Area U+E000…U+EFFF (4096 distinct sentinels). PUA is chosen because it never appears in normal output from `cat`, `yes`, or `pv`, so a grid scan for a specific codepoint cannot collide with unrelated output. Only one sample is in flight at a time, so wraparound at 4096 is safe.

PUA codepoints render as `.notdef` (tofu) for most fonts; this is fine — the grid cell carries the codepoint regardless of glyph appearance.

## Module 2: Injection

A fabricated `wayland_client.Keyboard.KeyEvent` is pushed onto `keyboard.event_queue`. The event has `utf8` filled with the sentinel's UTF-8 encoding and `utf8_len` set appropriately. No other fields (keysym, modifiers) are needed for the UTF-8 path at `src/main.zig:393`:

```zig
if (ev.utf8_len > 0) {
_ = try p.write(ev.utf8[0..ev.utf8_len]);
}
```

`t_inject` is captured with `std.time.Instant.now()` immediately before the push. On Linux this uses `CLOCK_MONOTONIC`, matching `wp_presentation_feedback`'s clock. (Leave a source comment asserting this assumption.)

The real `src/wayland.zig` keyboard binding remains connected; only `.key` events are suppressed in bench mode (see Architecture §4).

## Module 3: PTY termios

At PTY spawn, the bench explicitly validates the slave's termios:

1. `tcgetattr(slave_fd, &tio)`.
2. If `tio.c_lflag & ECHO` is unset, set it and `tcsetattr(slave_fd, TCSANOW, &tio)` before the child's `exec`.
3. Leaves canonical mode (`ICANON`) alone — the measurement is kernel-echo-based, which fires per byte regardless of canonical mode.

This removes any assumption about defaults and makes the bench portable across distros.

## Module 4: Presentation feedback

Requires binding `wp_presentation_time` (global `wp_presentation`). The repo does not currently bind this protocol; this is real implementation work, not plumbing.

Flow per frame:

1. Immediately before each `vkQueuePresentKHR` that follows any pending injection, call `wp_presentation.feedback(surface)` to get a new `wp_presentation_feedback` object. Record the association: `{frame_counter → feedback_object}`.
2. The feedback object fires one of two events eventually:
- `presented(tv_sec_hi, tv_sec_lo, tv_nsec, refresh, seq_hi, seq_lo, flags)` — frame was latched by the compositor. Compute `t_present = tv_sec * 1e9 + tv_nsec` (CLOCK_MONOTONIC nanoseconds).
- `discarded` — frame was superseded. Drop the feedback object but keep listeners on subsequent frames; the sentinel is still in the grid for the next frame.
3. The Vulkan WSI owns the Wayland `wl_surface.commit` call; `wp_presentation.feedback` attaches to the next commit on the surface. As long as the feedback request precedes `vkQueuePresentKHR`, it captures the right commit.

## Module 5: Pair-on-arrival matching

An in-flight sample has two asynchronous completions:

- **Grid-side:** scan each frame's `term.snapshot()` for the sentinel codepoint. Record the first frame K where the sentinel is present.
- **Feedback-side:** when `presented` fires for frame K's feedback object, record `t_present_K`.

The sample completes when *both* are known for the same frame K. Latency = `t_present_K - t_inject`. Store in a ring buffer (`u64` nanoseconds).

If frame K's feedback is `discarded`, advance to frame K+1's feedback (still looking for sentinel in K+1's grid; the sentinel cell remains until scrolled off, which the hot workload is sized to prevent within the timeout window).

A sample times out if no match within `max_frames_per_sample = 60` (~1s at 60Hz). Timeouts are counted as a separate stat, not folded into the latency distribution.

Grid scan cost: for an 80×24 grid that's 1920 cells per frame — negligible.

## Module 6: Frame-K correlation

The existing `FrameTiming` struct in `src/bench_stats.zig` is positional in its ring, with no notion of absolute frame identity. To correlate a sample with the timing entry for frame K, add a monotonic `frame_counter: u64` field to `FrameTiming`, incremented once per rendered frame. The bench driver records frame K on each sample; the output stage can then join samples against timing entries for the per-stage breakdown.

## Module 7: Sample scheduling & WSI fallback

After a sample matches, schedule the next injection on the first main-loop iteration following the *next* frame (one-frame floor — prevents two injections per frame without a magic millisecond number).

Target sample count per scenario: **500**.

**WSI fallback:** If more than 10% of the first 50 samples time out, abort the bench with a diagnostic error on stderr naming `wp_presentation.feedback` commit race with the Mesa WSI thread as the likely cause, and suggesting `VK_KHR_present_wait` as an alternative coordination strategy to investigate. Prevents shipping garbage numbers silently when the feedback attachment misbehaves.

## Module 8: Output

On exit, prints to stderr:

```
=== waystty input latency (500 cold, 500 hot, 80x24 grid) ===
scenario min avg p50 p99 max (µs) timeouts
cold 1200 4800 4100 9200 14100 0
hot 3400 12700 11500 28400 42000 3
───────────────────────────────────────────────────────────
```

All latencies in µs, matching the existing output-bench format.

Per-stage breakdown for p99 samples (one line per p99 sample, joined on `frame_counter`):

```
p99 breakdown (µs, sample latency / frame K):
cold sample 487 (latency 9200, frame 12340):
snapshot 4, row_rebuild 120, atlas_upload 0, instance_upload 6, gpu_submit 8100
```

This module is not gating — ship it after the headline metric works.

## Module 9: Makefile target

```makefile
# Expected runtime: ~15s cold + ~25s hot = ~40s total
bench-input:
$(ZIG) build -Doptimize=$(OPT)
WAYSTTY_INPUT_BENCH=both ./zig-out/bin/waystty 2>bench-input.log || true
@echo "--- input latency ---"
@grep -A 12 "waystty input latency" bench-input.log || echo "(no timing data found)"
```

Mirrors the existing `bench` / `profile` targets: `OPT` defaults to `ReleaseFast`; output in `bench-input.log`.

## Files changed

- `src/bench_input.zig` — new. `BenchDriver` struct, sentinel allocator, in-flight sample state, pending-feedback map, stats printer, WSI fallback trigger.
- `src/bench_stats.zig` — add `frame_counter: u64` to `FrameTiming`; increment on each rendered frame.
- `src/main.zig` — env parsing, grid-size lock (shared across bench modes, retrofits existing `WAYSTTY_BENCH`), `xdg_toplevel` min/max size hints, resize-observer abort on mismatch, driver init/tick hookup, PTY child switcheroo, termios verification, scenario sequencer with teardown, sentinel scan in the post-frame hook, bench-keyboard `.key`-suppression gate.
- `src/wayland.zig` — bind `wp_presentation` global, plumb `wp_presentation_feedback` creation and event callbacks.
- `src/renderer.zig` — expose swapchain present call site so `wp_presentation.feedback(surface)` can be called immediately before `vkQueuePresentKHR`.
- `Makefile` — `bench-input` target.

## Testing

- **Unit:** `BenchDriver` grid scan finds the active sentinel at arbitrary row/col positions in a synthetic grid.
- **Unit:** sentinel allocator produces 4096 distinct codepoints before wrapping.
- **Unit:** pair-on-arrival state machine completes when events arrive in either order; advances past `discarded`.
- **Unit:** WSI-fallback trigger fires after >10% of first 50 samples time out and not before.
- **Manual (input bench):** grid is 80×24 at bench start regardless of window size; verify via logging.
- **Manual (existing output bench):** `make bench` still passes after grid-lock retrofit; grid-size line appears in stats output; verify on both floating and tiling compositors (tiling should abort with the diagnostic).
- **Manual:** termios `ECHO` is set on the PTY slave after spawn; verify with a short run of `stty -a` in debug mode.
- **Manual:** `make bench-input` produces cold and hot numbers with cold < hot. Sanity: cold p50 on the order of one refresh interval (wait-to-vsync plus compositor latch) is expected; hot p99 noticeably larger than cold p99 confirms contention is being observed.
- **Manual:** run `WAYSTTY_INPUT_BENCH=cold` against a debug build; verify via logs that each injected sentinel round-trips and matches within a few frames.

## Open implementation risks

- **`wp_presentation_time` binding is non-trivial.** Neither the global nor the feedback callbacks exist in the repo. Budget real time for this, including the WSI-commit coordination (feedback request must precede `vkQueuePresentKHR`).
- **Mesa Wayland WSI internals.** Confirm empirically that `wp_presentation.feedback(surface)` called immediately before `vkQueuePresentKHR` attaches to the driver's commit. If not, the Module 7 fallback surfaces it; the longer-term fix is `VK_KHR_present_wait` or a different coordination strategy.
- **`pv` must be available.** It's the rate-limiter for the hot workload. If not installed, the hot scenario aborts with a clear install hint. Listed in Makefile comments.