a73x

docs/superpowers/specs/2026-04-18-vulkan-bounded-waits-design.md

Ref:   Size: 20.0 KiB

# Vulkan Bounded Waits Design

## Goal

Stop waystty from hanging when the Vulkan driver drops a fence signal. Replace every unbounded `vkWaitForFences` / `vkAcquireNextImageKHR` / `vkDeviceWaitIdle` / `vkQueueWaitIdle` call with bounded variants that return a recoverable error on timeout. On timeout, the main loop logs the event, skips the frame, and retries on the next iteration. The window stays responsive — keystrokes, resize, and exit continue to work — instead of wedging on the last-presented frame.

Tracks issue `ab6c92f0`. Closes `793f491a` as a duplicate.

## Background

Two freezes were observed on the same daily-driver session (2026-04-18). Both wedged the main thread inside the NVIDIA 595.58.03 driver at:

```
main.runTerminal
  renderer.Context.drawCells
    vk.DeviceWrapper.waitForFences(..., timeout=18446744073709551615)
      libnvidia-glcore internal poll() ← wedged
```

The driver appears to drop a kernel syncobj / sync-fd notification for `in_flight_fence`. The GPU work likely completes; the userspace poll never wakes. With `timeout=UINT64_MAX`, the main loop blocks permanently — no input dispatch, no resize, no exit.

The fix is straightforward: bounded timeouts on every wait, classified errors, frame-skip on timeout. This is not a recovery design — there is no GPU-resource rebuild, no `VK_ERROR_DEVICE_LOST` handling, no exit-after-N-failures. We have not observed `DEVICE_LOST` and we are not designing for it. Those remain open in the ticket as future scope.

## Non-goals

- GPU-resource rebuild on timeout (swapchain, fences, pipeline, atlas, instance buffer).
- `VK_ERROR_DEVICE_LOST` handling. Treated as fatal if it ever fires; not observed.
- Recreating the `VkDevice`. Not observed.
- Exit-after-N-consecutive-timeouts policy. Deferred — if the driver wedges persistently, the main loop will keep retrying with backoff. The user can `Ctrl+C` or close the window because the loop is unblocked between waits.
- Auditing every existing `deviceWaitIdle` / `queueWaitIdle` for whether it should remain unbounded. The grep gate forces every site to declare intent (bounded or shutdown), but the per-site policy review is filed as a follow-up ticket.

## Architecture

A new module `src/vk_sync.zig` exposes the bounded primitives. `Context` and `main.zig` call into this module instead of `self.vkd.waitForFences` etc. directly. A grep gate in the test target fails CI if any source file calls the underlying `vkd` methods outside `vk_sync.zig` itself.

Three concerns separated:

1. **The bounded helpers.** Five free functions, taking `vkd`, the `device` handle, and the operation-specific args. Stateless, easy to unit test.
2. **The fence-state invariant.** A documented contract about the state of `in_flight_fence` after each error return from `drawCells` / `drawClear` / `renderToOffscreen`, enforced by the order of operations in those functions.
3. **The timeout observability layer.** A counter + rate-limited `std.log.warn` call so persistent timeouts surface without spamming the log.

## Module 1: `src/vk_sync.zig`

Five public functions plus two error definitions and two timeout constants.

```zig
pub const fence_wait_timeout_ns: u64 = 2_000_000_000;   // 2 s
pub const acquire_timeout_ns:    u64 = 100_000_000;     // 100 ms

pub const SyncError = error{ VkWaitTimeout, VkAcquireTimeout };

/// Bounded fence wait. On timeout, returns error.VkWaitTimeout without
/// touching the fence. The fence remains in whatever state it was — caller
/// can safely retry the wait on the next iteration.
pub fn waitFenceBounded(vkd: vk.DeviceWrapper, device: vk.Device, fence: vk.Fence) !void;

/// Bounded image acquire. Returns the acquired image_index on success.
/// Folds VK_SUBOPTIMAL_KHR into error.OutOfDateKHR (matches existing caller
/// semantics, which already collapse the two via swapchainNeedsRebuild).
/// Returns error.VkAcquireTimeout on timeout.
pub fn acquireImageBounded(
    vkd: vk.DeviceWrapper,
    device: vk.Device,
    swapchain: vk.SwapchainKHR,
    semaphore: vk.Semaphore,
) !u32;

/// Bounded device idle. For uses where blocking forever is wrong (recovery
/// arms, mid-frame resync, etc.). Returns error.VkWaitTimeout on timeout.
pub fn waitIdleBounded(vkd: vk.DeviceWrapper, device: vk.Device, timeout_ns: u64) !void;

/// Unbounded device idle, named to make intent obvious at the call site.
/// Use only for shutdown drains where blocking forever is acceptable.
/// The grep gate exempts this function so it is the canonical way to write
/// an intentional unbounded wait.
pub fn waitIdleForShutdown(vkd: vk.DeviceWrapper, device: vk.Device) void;

/// Bounded queue idle. Same shape as waitIdleBounded.
pub fn queueWaitIdleBounded(vkd: vk.DeviceWrapper, queue: vk.Queue, timeout_ns: u64) !void;
```

Why functions and not a wrapper struct: see "Approach: helpers, not dispatch wrapper" below.

`waitIdleForShutdown` returns `void` (not `!void`) because the only legitimate failure modes (`VK_ERROR_DEVICE_LOST` during shutdown) are unactionable — log internally and return.

## Module 2: Caller updates

Ten call sites are migrated. All eight `waitForFences` and both `acquireNextImageKHR` calls in waystty source today, taken from a fresh grep (the original ticket missed three of them):

| Site | File | Lines | New helper |
|---|---|---|---|
| `drawClear` fence wait | `src/renderer.zig` | 1275 | `waitFenceBounded` |
| `drawClear` acquire | `src/renderer.zig` | 1279 | `acquireImageBounded` |
| `uploadAtlasRegion` fence | `src/renderer.zig` | 1478 | `waitFenceBounded` |
| `drawCells` fence wait | `src/renderer.zig` | 1710 | `waitFenceBounded` |
| `drawCells` acquire | `src/renderer.zig` | 1718 | `acquireImageBounded` |
| `renderToOffscreen` first wait | `src/renderer.zig` | 1818 | `waitFenceBounded` |
| `renderToOffscreen` capture wait | `src/renderer.zig` | 1824 | `waitFenceBounded` |
| `renderToOffscreen` post-submit wait | `src/renderer.zig` | 1941 | `waitFenceBounded` |
| `drawTextCoverageCompareFrame` fence | `src/main.zig` | 1364 | `waitFenceBounded` |
| `drawTextCoverageCompareFrame` acquire | `src/main.zig` | 1373 | `acquireImageBounded` |

The 13 `deviceWaitIdle` and 1 `queueWaitIdle` sites (14 total) are migrated mechanically to `waitIdleForShutdown` (preserving existing unbounded behavior). Choosing the *named* shutdown helper documents intent at the call site without changing semantics. A follow-up ticket (filed as part of this work) audits whether any of those should become `waitIdleBounded` instead — the recovery arm at `main.zig:681` is the most suspicious site.

## Module 3: Fence-state invariant + reorder fix

Today, in both `drawClear` and `drawCells`:

```zig
try waitForFences(in_flight_fence, UINT64_MAX);   // line 1275 / 1710
try resetFences(in_flight_fence);                 // line 1276 / 1711
const acquire = try acquireNextImageKHR(...);     // line 1282 / 1721
```

`resetFences` runs *before* the acquire. With newly-possible acquire timeouts, this introduces a deadlock: if acquire returns `error.VkAcquireTimeout`, the fence is left unsignaled with no submit pending. Every subsequent fence wait then times out forever — the GPU has nothing in flight that will signal it.

Fix: move `resetFences` to *after* a successful acquire and before the submit.

```zig
try vk_sync.waitFenceBounded(...);
const image_index = try vk_sync.acquireImageBounded(...);
try vkd.resetFences(...);            // moved here
// ... submit, present
```

This is sync-safe because `in_flight_fence` tracks submit completion (not acquire); acquire→submit ordering is handled by `image_available` semaphore.

After the reorder, the fence-state invariant per error return is documented at the top of `drawCells` and `drawClear`. **Contract:** the fence is *only* touched (reset, re-signaled) on the post-acquire failure paths — see "Post-acquire failure: fence re-signal" below. On `VkWaitTimeout` and `VkAcquireTimeout`, no GPU state is touched.

| Error returned by `drawCells` / `drawClear` | `in_flight_fence` state |
|---|---|
| `error.VkWaitTimeout` (from `waitFenceBounded`) | Whatever it was: signaled, or unsignaled with prior-frame submit still in flight. Next iteration's wait is safe. |
| `error.VkAcquireTimeout` (from `acquireImageBounded`) | Same as before this frame — reset has not run. Safe. |
| `error.OutOfDateKHR` post-acquire, pre-submit | See "Post-acquire failure" below — fence re-signaled before return. |
| `error.OutOfDateKHR` post-submit (from present) | Signaled by GPU on submit completion. Caller's existing `deviceWaitIdle` + rebuild drains it. |
| Success | Signaled by GPU when submit completes; reset at the top of the next frame. |

### Post-acquire failure: fence re-signal

If `acquireImageBounded` succeeds and `resetFences` runs, but a subsequent operation fails before `queueSubmit` (the command-buffer recording calls between `resetFences` at line 1711 and `queueSubmit` at line 1773 in current `drawCells` — `resetCommandBuffer`, `beginCommandBuffer`, `cmdBeginRenderPass`, `recordDrawCommands`, `cmdEndRenderPass`, `endCommandBuffer`), the fence is unsignaled with no submit pending. To preserve the invariant, `drawCells` / `drawClear` re-signal the fence before returning. Concretely: in the `errdefer` path between `resetFences` and `queueSubmit`, submit a no-op (empty submit info with `signal_fence = in_flight_fence`) to put the fence back in the signaled state. Implementation note for the plan: this is one helper invocation, not five lines per site.

**`vkResetFences` failure**: per the Vulkan spec, the only failure mode for `vkResetFences` is `VK_ERROR_OUT_OF_DEVICE_MEMORY` — extremely rare. If this happens, the fence is in an undefined state and an image has been acquired with no submit pending. Treat as fatal: propagate the error from `drawCells` / `drawClear`, do not attempt re-signal (the device is in a degraded state and a no-op submit is unlikely to succeed). The main loop will see this as a non-recoverable error and exit.

`renderToOffscreen` has a related but different issue: it waits on `in_flight_fence` *before* `uploadInstances` to prevent host-overwrites of `instance_memory` while the GPU is still reading. With timeouts, the wait can return early. The function must propagate `error.VkWaitTimeout` *before* calling `uploadInstances`. Order is preserved (`waitFenceBounded` → check error → `uploadInstances`); no re-signal needed since `renderToOffscreen` doesn't reset `in_flight_fence`.

## Module 4: Atlas upload timeout

`uploadAtlasRegion` waits on `atlas_transfer_fence` before recording the upload command buffer. On timeout, the function returns `error.VkWaitTimeout` without modifying any GPU state.

The atlas region's CPU-side dirty flag is left set automatically by the existing call structure: in `main.zig:591–605`, `atlas.dirty = false` runs *after* `try ctx.uploadAtlasRegion(...)`, so the `try` propagates the timeout error before the dirty flag is cleared. (`font.Atlas.dirty` is just a `bool` — there is no explicit retry mechanism in the Atlas struct itself; the survival depends on `try`-propagates-before-side-effects ordering in the caller.) On the next render iteration, the upload retries. If the underlying driver wedge persists, `uploadAtlasRegion` will keep timing out and `drawCells` will keep skipping frames; this is the same outcome as a `drawCells` fence timeout.

Caller (in `main.zig`) treats `error.VkWaitTimeout` from atlas upload identically to `drawCells` timeout: log via `logVkTimeout`, skip the render iteration, mark frame dirty.

## Module 5: Logging

Single function in `src/vk_sync.zig`:

```zig
var vk_timeout_count: std.atomic.Value(u64) = .init(0);
var last_log_ns: std.atomic.Value(i128) = .init(0);

const TimeoutKind = enum { fence, acquire, atlas };

pub fn logVkTimeout(src: std.builtin.SourceLocation, kind: TimeoutKind) void {
    const n = vk_timeout_count.fetchAdd(1, .monotonic) + 1;
    const now = std.time.nanoTimestamp();
    const last = last_log_ns.load(.monotonic);
    if (n == 1 or (now - last) > 5 * std.time.ns_per_s) {
        last_log_ns.store(now, .monotonic);
        std.log.warn("vk timeout #{} ({s}) at {s}:{d} — driver may be wedged",
            .{ n, @tagName(kind), src.file, src.line });
    }
}
```

First occurrence logs immediately. Subsequent occurrences log at most once per 5 seconds with a running count. This makes a one-off flake visible (one line, no spam) and a persistent wedge observable (one line every 5 s with a growing counter), without flooding stderr in either case.

The 5-second window is a heuristic — short enough that "is it still happening?" is answerable within a few seconds, long enough that a 60Hz wedge produces ~1 log line per 5 s instead of ~300.

## Module 6: Backoff on persistent timeout

In `main.zig`, the timeout arm tracks `consecutive_vk_timeouts: u32`:

```zig
error.VkWaitTimeout, error.VkAcquireTimeout => {
    vk_sync.logVkTimeout(@src(), .fence);
    consecutive_vk_timeouts += 1;
    const backoff_us = @min(consecutive_vk_timeouts * 5_000, 100_000); // cap 100 ms
    std.time.sleep(backoff_us * std.time.ns_per_us);
    render_pending = true;
    continue;
},
// On any successful drawCells:
consecutive_vk_timeouts = 0;
```

5 ms first miss, 10 ms second, … capped at 100 ms after ~20 consecutive misses. Prevents a tight retry loop from burning a CPU core when the GPU is genuinely wedged. No exit, no recovery — just don't spin.

**Frame-loop interaction**: the `std.time.sleep` runs inside the timeout-arm `continue` path, before the next iteration's `waitFrame` call. It does not interact with `frame_loop.zig`'s `pending_token` / `armed` state — the sleep simply delays re-entry into the main loop, after which the normal frame-loop gating runs as usual. No `forceArm` needed; the sleep is orthogonal.

## Approach: helpers, not dispatch wrapper

We considered three consolidation strategies:

- **(α) Inline pattern.** Same code repeated at each site. Rejected — original ticket missed three sites; a structural fix is needed.
- **(β) Free-function helpers + grep gate.** Adopted.
- **(γ) Dispatch-table wrapper.** Wrap `vk.DeviceWrapper` in a struct that hides timeout-accepting methods. Rejected after research.

Research into production Vulkan projects found no precedent for γ. wgpu (`wgpu-hal/src/vulkan/device.rs`) and Dawn (`QueueVk.cpp`) both expose helper-style abstractions with timeouts as first-class arguments, but neither hides `UINT64_MAX` at the type level — `Option<Duration>::None` and `Nanoseconds` newtypes still resolve to a raw `u64`. Vulkan-Hpp, the canonical "type-safe" C++ wrapper, declined to add a `std::chrono::duration` overload for `waitForFences` at all. There is no clang-tidy check, no validation-layer rule, and no Zig-binding precedent for wrapping the dispatch table.

β + grep-gate matches every production Vulkan project surveyed and avoids being a category of one. The grep gate covers the same gap γ would have prevented, at a fraction of the abstraction cost.

The one ergonomic borrow from γ is the `waitIdleForShutdown` named helper. This puts intent at the call site without requiring a comment, and replaces what would otherwise be a `// vk-unbounded-ok: shutdown` exemption convention.

## Module 7: Grep gate

A shell script `tests/check_unbounded_vk.sh` runs as part of `zig build test`. It uses `rg --multiline --pcre2` (multi-line is required — `main.zig:1364` proves the existing `waitForFences` call spans 7 lines) to flag any of these identifier patterns outside the exempt files:

- `\bvkd\.waitForFences\b`
- `\bvkd\.acquireNextImageKHR\b`
- `\bvkd\.deviceWaitIdle\b`
- `\bvkd\.queueWaitIdle\b`

**Exempt files:**

- `src/vk_sync.zig` — the helper module itself (contains inline `test` blocks per codebase convention).

The script exits non-zero if any match is found in non-exempt files, printing each match with file:line. No `// vk-unbounded-ok:` comment exemption — every legitimate unbounded wait goes through `waitIdleForShutdown` (already exempt as the canonical path). If a future case truly needs a raw call, it should add a method to `vk_sync.zig`, not bypass the gate.

The choice of shell script over a Zig AST walker is deliberate: four identifier patterns, one rg invocation, no parsing complexity, and the implementation is ~10 lines.

## Testing

Three layers:

**Unit tests — inline `test` blocks in `src/vk_sync.zig`** (following the existing codebase convention; all other Zig modules put `test "..."` blocks inline in the source file rather than in a separate `tests/` directory).

The Vulkan-touching tests require a live `vk.DeviceWrapper`, which isn't available in pure unit tests. To keep the helpers testable without a Vulkan context, the testable surface is the non-Vulkan logic:

- `logVkTimeout` emits exactly one log line when called 100 times in a tight loop (rate-limit working) — pure logic, atomics + `std.log.warn`.
- `logVkTimeout` emits a second log line if called again after 5+ seconds (can be unit-tested with a mockable clock, or by refactoring the 5s threshold into a parameter for the test and passing a small duration).
- Constants (`fence_wait_timeout_ns`, `acquire_timeout_ns`) are public and have the expected values.

The Vulkan-touching behavior (`waitFenceBounded` returning `VkWaitTimeout` on a never-signaled fence, `acquireImageBounded` folding `suboptimal_khr` into `OutOfDateKHR`) is verified by inspection of the bindings' documented return values plus integration testing against the live renderer (see below). A full Vulkan-context test harness is out of scope for v1.

**Integration test (manual or scripted):** simulate the wedge by submitting a fence that intentionally never signals (e.g., wait on a semaphore that's never signaled), then run a few frames of `drawCells`. Verify: no hang, log lines appear with backoff, main loop continues, Wayland input dispatch keeps working. This is hard to fully automate without a Vulkan mock layer; it can be scripted as a sanity check the human runs by hand.

**Grep gate:** the test described in Module 7. Runs on every `zig build test`.

## Implementation order

A plan-stage detail, but worth flagging: the reorder fix (Module 3) and the helper introduction (Module 1) should land in separate commits. The reorder is a correctness fix that makes sense in isolation; the helpers stack on top.

Suggested commit order:

1. Add `src/vk_sync.zig` with helpers + tests. Wire it into `build.zig` (see "Build wiring" below). No callers changed yet.
2. Reorder `resetFences` to after acquire in `drawClear` and `drawCells`. No new errors yet — keep `UINT64_MAX` for one commit. Add the post-acquire `errdefer` re-signal helper.
3. Migrate the 10 timeout sites to bounded helpers. Add the timeout error arms in `main.zig`.
4. Migrate the 14 `deviceWaitIdle` / `queueWaitIdle` sites to `waitIdleForShutdown`.
5. Add the grep gate test.
6. Add logging + backoff in `main.zig`.

### Build wiring

`src/vk_sync.zig` is imported by both `src/renderer.zig` and `src/main.zig`. It needs to be a separate module in `build.zig`, similar to how `cell_instance` is wired (build.zig:311). Concretely: create `vk_sync_mod` with `addImport("vulkan", vulkan_module)`; have `renderer_mod`, `exe_mod`, `main_test_mod`, `renderer_test_mod`, and `capture_mod` each `addImport("vk_sync", vk_sync_mod)`. Also create a `vk_sync_test_mod` (same root_source_file, separate module instance for the inline tests) and wire it to the `test` step via `b.addTest({.root_module = vk_sync_test_mod})`, matching the pattern used by `png_tests` (build.zig:308).

The grep gate `tests/check_unbounded_vk.sh` is invoked from `build.zig`'s `test` step via `b.addSystemCommand(&.{ "tests/check_unbounded_vk.sh" })` and added as a step dependency.

## Ticket cleanup

- **Close `793f491a`** as duplicate of `ab6c92f0` (older by ~1 minute, identical body).
- **Open follow-up ticket** "Audit `deviceWaitIdle` / `queueWaitIdle` for unbounded blocks" — referenced by this design and seeded with the 15 sites that get mechanically migrated to `waitIdleForShutdown` in Module 2. Each site should be reviewed for whether the unbounded behavior is actually correct (most likely yes for `deinit`; the recovery arm at `main.zig:681` is the most suspicious site).
- Move `ab6c92f0` from `backlog` to `planning` after this spec lands; to `dev` when implementation starts.