docs/superpowers/specs/2026-04-17-gpu-render-testing-design.md

Ref: Size: 12.6 KiB
# GPU Render Testing Design

Automated visual regression and performance regression testing for the Vulkan rendering pipeline.

## Problem

There is no automated way to verify that waystty's GPU output is correct, or to detect rendering performance regressions. The frame timing instrumentation measures *how fast* frames render but not *what* they render. Smoke tests validate subsystem init but not visual output.

## Goals

- Detect visual rendering regressions automatically (wrong colors, missing glyphs, shifted cells, broken attributes)
- Detect performance regressions in the rendering pipeline
- Keep the test infrastructure local-only, zero external dependencies beyond Zig
- Produce actionable failure output (diff images, timing comparisons)

## Non-Goals

- CI/headless testing (requires real GPU + compositor)
- Perceptual/SSIM diffing (RMSE is sufficient for terminal content)
- External compositor screenshots (captures happen in-process via Vulkan readback)
- Multi-scale testing (tests fix scale to 1x; scale-variant coverage is future work)
- Cross-machine reproducibility (baselines are inherently local; font is pinned by system fontconfig)

## Design

### 1. Capture Mode (`--capture <script> <output.png>`)

A new CLI mode that renders a VT script and captures the final frame as a PNG.

**Rendering target: dedicated offscreen image, not the swapchain.**

The capture frame is rendered to a waystty-owned `VkImage` created with `TRANSFER_SRC_BIT | COLOR_ATTACHMENT_BIT`, separate from the swapchain. This avoids two problems:
- Swapchain images are created without `TRANSFER_SRC_BIT` and sit in `PRESENT_SRC_KHR` layout after present — can't be copied from without layout dance + reacquisition
- Compositor behavior (damage, alpha handling) can in principle alter presented pixels

The offscreen image uses the same format as the swapchain (`B8G8R8A8_UNORM`) and a dedicated single-use render pass + framebuffer.

**Fixed window configuration.**

Capture mode forces a known, deterministic environment:
- **Window size:** fixed 80 columns × 24 rows at scale=1 (so ~`80 * cell_width × 24 * cell_height` px)
- **Scale:** forced to 1x; any compositor-reported scale is ignored in capture mode
- **Font:** inherited from `src/config.zig` (Monaspace Argon @ 16px, resolved via system fontconfig). Not user-overridable without a code change; treated as pinned for local use.

If the compositor sizes the window differently, capture mode rejects the frame and fails with an explicit error.

**Flow:**

1. waystty starts normally (Wayland surface, Vulkan pipeline, PTY)
2. The child shell is replaced by `cat <script>` piped into the PTY
3. **Wait for visibility:** block until first `xdg_surface.configure` + first frame callback complete. If this doesn't happen within 3 seconds, exit with error (e.g. window was spawned hidden)
4. **Script playback:** feed script bytes through the PTY; child exits at EOF
5. **Drain:** after child exits, keep polling the PTY read end until it returns no data for two consecutive 20ms poll cycles (drains any VT output still pending in the kernel buffer)
6. **Settle:** run the event loop for one additional 50ms tick to let the VT parser process everything
7. **Render final frame to offscreen image:** one synchronous render pass targeting the offscreen `VkImage`, with a fence we wait on
8. **Readback:** `vkCmdCopyImageToBuffer` from the offscreen image (in `TRANSFER_SRC_OPTIMAL` layout) to a host-visible staging buffer, second fence-wait
9. **Write PNG:** raw BGRA pixel data → PNG (BGRA→RGBA swap, sRGB-encoded bytes; PNG gets an sRGB chunk). Alpha channel is forced to 255 (opaque) — compositor surface uses `opaque_bit_khr` composite-alpha so source alpha is meaningless
10. **Dump timings:** frame timing stats to stderr (reusing existing benchmark output)
11. Exit cleanly

**Code locations:**
- Arg parsing + visibility wait + quiescence sequencing + exit-after-capture: `src/main.zig`
- Offscreen image creation + readback + PNG write: `src/renderer.zig`

### 2. Golden Image Comparison

A standalone Zig tool that compares two PNG images and reports pass/fail.

**Comparison logic:**
- Load actual and reference PNGs (RGBA8)
- Hard fail if dimensions differ (means grid size or scale drifted — a structural change, not a threshold miss)
- **RMSE:** for each pixel, compute Euclidean distance across RGB channels (alpha ignored; capture forces opaque). Normalize each channel to [0,1] before distance, so `distance = sqrt((dR² + dG² + dB²) / 3)` per pixel. RMSE is the root-mean-square of those per-pixel distances across the whole image
- **Per-pixel max:** track the worst single-pixel distance (same normalization)
- **Pass criteria (both must hold):**
  - RMSE ≤ `WAYSTTY_TEST_RMSE_MAX` (default 0.005, i.e. 0.5%)
  - Max per-pixel distance ≤ `WAYSTTY_TEST_PIXEL_MAX` (default 0.125, i.e. 32/255)
- Defaults are tuned empirically during initial reference generation — re-running on the same machine without code changes should pass with margin. If it doesn't, thresholds get loosened before shipping
- On failure: write a diff image and report both values

**Diff image:** side-by-side layout (`[actual | reference | delta]`), where `delta` is a grayscale heatmap of per-pixel RGB distance (bright = divergent). Easier to eyeball than a pure overlay.

**Failure output:**
```
FAIL: tests/golden/reference/bold_colors.png
  RMSE: 1.2% (max 0.5%)
  worst pixel: 18.0% (max 12.5%)
  diff: tests/golden/output/bold_colors.diff.png
  actual: tests/golden/output/bold_colors.png
```

**Why Zig, not ImageMagick/Python:**
- Zero external deps — project is pure Zig
- Same build system — `zig build test-render` builds and runs everything

**PNG implementation:** vendored minimal RGBA8 encoder/decoder (~300 lines). zigimg would work but its breadth is overkill for a file format we only use one byte layout of. Vendor the code into `src/png.zig` to keep the dep graph small.

**Code locations:**
- Image comparison executable: `src/tools/imgdiff.zig` (new build target in `build.zig`)
- PNG encode/decode: `src/png.zig` (shared by renderer capture + imgdiff)

### 3. VT Test Scripts

Curated escape sequences, one per rendering feature:

| Script | Exercises |
|--------|-----------|
| `basic_ascii.vt` | Full printable ASCII range, default colors |
| `bold_colors.vt` | Bold, dim, italic, underline + 16 ANSI colors |
| `256_colors.vt` | 256-color palette grid |
| `truecolor.vt` | 24-bit RGB gradients |
| `box_drawing.vt` | Box-drawing and block element characters |
| `cursor_movement.vt` | Cursor positioning, clear, scroll regions |
| `reverse_video.vt` | Reverse video, hidden, strikethrough attributes |

Start with 3–4 scripts initially (`basic_ascii`, `bold_colors`, `box_drawing`), add more as features land.

**Script format convention:**
- Raw bytes, no encoding (scripts are fed verbatim through the PTY)
- Use `\r\n` line endings (PTY defaults make this the common case)
- End each script with cursor-home (`\x1b[H`) so final cursor position is deterministic
- No trailing "capture marker" needed — the capture sequence (child exit → drain → settle → render) handles quiescence

**Directory layout:**
```
tests/golden/
├── scripts/         # VT input files
│   ├── basic_ascii.vt
│   ├── bold_colors.vt
│   └── ...
├── reference/       # Approved golden PNGs (checked into git)
│   ├── basic_ascii.png
│   ├── bold_colors.png
│   └── ...
└── output/          # Generated by test run (gitignored)
    ├── basic_ascii.png
    ├── basic_ascii.diff.png
    └── ...
```

### 4. Performance Regression Detection

Leverages the existing FrameTimingRing. The `--capture` mode gets timing data for free, but the capture workload is too short for stable p99s. Perf regression uses the existing `WAYSTTY_BENCH=1` scripted workload (`src/main.zig:~216`).

**Baseline file comparison:**
- `make bench-baseline` runs the benchmark workload, writes stats to `tests/bench/baseline.json`
- `make bench-check` runs the same workload, compares against baseline

**Baseline schema:**
```json
{
  "workload_sha": "sha256 of the exact bench shell script",
  "zig_version": "0.15.0",
  "waystty_sha": "git HEAD at capture time",
  "frame_count": 256,
  "sections": {
    "snapshot":        {"min": 8,   "avg": 98,  "p99": 21,  "max": 266},
    "row_rebuild":     {"min": 109, "avg": 761, "p99": 603, "max": 1572},
    "atlas_upload":    {"min": 0,   "avg": 0,   "p99": 0,   "max": 0},
    "instance_upload": {"min": 13,  "avg": 51,  "p99": 19,  "max": 122},
    "gpu_submit":      {"min": 30,  "avg": 73,  "p99": 90,  "max": 100}
  }
}
```

All timings in microseconds.

**Regression detection:**
- Each section's p99 is compared independently
- p99 must not increase by more than `WAYSTTY_BENCH_REGRESSION_PCT` (default 20%)
- Any single section exceeding the threshold triggers a failure, even if other sections improved
- On `workload_sha` mismatch: warn loudly (not fail) — means the bench script changed, baseline should be regenerated
- On `frame_count` mismatch of more than 20%: warn (statistical power differs enough to make the comparison unreliable)

**Why p99:** Averages hide spikes. A change adding occasional 5ms stalls to `gpu_submit` won't move the average but will tank perceived smoothness.

**Output:**
```
bench: snapshot      p99   21us (baseline   19us)  +10.5%  OK
bench: row_rebuild   p99  603us (baseline  580us)   +3.9%  OK
bench: gpu_submit    p99  290us (baseline   90us) +222.2%  REGRESSION
```

**Baseline management:**
- `baseline.json` is checked into git — the "known good" perf profile for this machine
- After intentional perf changes, `make bench-baseline` to update
- Machine-specific by nature — local-only tool, so this is fine

### 5. Makefile Targets

| Target | Action |
|--------|--------|
| `make test-render` | Run all `.vt` scripts via `--capture`, diff against goldens. Continues on failure; summarizes at end; non-zero exit if any failed. Orchestrated by a Zig tool (`zig build test-render`), not a shell loop, for consistency with the rest of the build. |
| `make golden-update` | Run all captures, copy `output/` to `reference/`. Used after visual verification of intentional changes. |
| `make bench-baseline` | Save current perf profile to `tests/bench/baseline.json` |
| `make bench-check` | Run benchmark workload, compare against baseline, flag regressions |

### 6. Error Handling

Capture mode is a developer tool; failure mode clarity matters. Explicit failures (stderr + non-zero exit):

| Condition | Exit code / message |
|-----------|---------------------|
| Script file not found | `capture: script not found: <path>` |
| Output directory unwritable | `capture: cannot write output: <path>: <errno>` |
| Window never becomes visible within 3s | `capture: window not visible after 3s (compositor hidden window?)` |
| Compositor sized window wrong | `capture: window size mismatch; expected 80x24 cells, got NxM` |
| Vulkan readback fence times out | `capture: GPU readback timed out (10s)` |
| PNG write fails | `capture: png encode failed: <reason>` |

All errors exit with status ≥ 2 (reserving 1 for generic failure). The orchestrator (`zig build test-render`) treats any non-zero exit from `--capture` as a test failure with the error message surfaced.

### 7. Dependencies

- No new external dependencies
- Vendored minimal PNG encoder/decoder at `src/png.zig`

## Workflow

1. Make a rendering change
2. `make test-render` — see if anything visually regressed
3. If a test fails, inspect the diff image in `tests/golden/output/*.diff.png`
4. If the change is intentional, `make golden-update` to approve new baselines
5. `make bench-check` — see if anything got slower
6. If perf changed intentionally, `make bench-baseline` to update

## Implementation Order

1. Offscreen render target + Vulkan readback + minimal PNG encode (`src/png.zig` + `src/renderer.zig`)
2. `--capture` CLI mode with full sequencing (visibility wait → playback → drain → settle → render → readback → write) (`src/main.zig`)
3. End-to-end smoke: write one trivial `.vt`, run `--capture`, eyeball the PNG
4. Image comparison tool (`src/tools/imgdiff.zig` + PNG decode)
5. Initial VT test scripts (`basic_ascii`, `bold_colors`, `box_drawing`)
6. Test orchestrator + Makefile targets (`test-render`, `golden-update`)
7. Generate and commit initial golden reference images; tune thresholds if needed
8. Benchmark baseline infrastructure (`bench-baseline`, `bench-check`) + initial `baseline.json`

## Open Housekeeping

The working tree has stray test binaries at repo root (`test_io`, `test_io2`, `test_io3`, `test_sig`, `test_timer`) that should be gitignored or relocated before `tests/` lands, to keep the test namespace clean.