docs/superpowers/plans/2026-04-18-vulkan-bounded-waits.md
Ref: Size: 47.6 KiB
# Vulkan Bounded Waits Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Stop waystty from hanging when the NVIDIA driver drops a Vulkan fence signal. Replace unbounded `waitForFences` / `acquireNextImageKHR` / `deviceWaitIdle` / `queueWaitIdle` calls with bounded variants that return recoverable errors, and add a grep gate so future code can't regress.
**Architecture:** A new module `src/vk_sync.zig` exposes five helpers (`waitFenceBounded`, `acquireImageBounded`, `waitIdleBounded`, `waitIdleForShutdown`, `queueWaitIdleBounded`) plus a rate-limited logger. 10 blocking-wait call sites migrate to the helpers; 14 `*WaitIdle` sites migrate mechanically to `waitIdleForShutdown`. A shell script grep gate in `tests/check_unbounded_vk.sh` fails CI if any source file bypasses the helpers. On timeout, the main loop logs via the rate-limited logger, applies exponential backoff (capped at 100 ms), and retries the frame on the next iteration.
**Tech Stack:** Zig 0.15+, vulkan-zig bindings (`vk.DeviceWrapper`), waystty's existing build.zig module graph. No new dependencies.
**Source spec:** `docs/superpowers/specs/2026-04-18-vulkan-bounded-waits-design.md`
**Tracks:** git-collab issue `ab6c92f0`. Closes `793f491a` as duplicate.
---
## Task 1: Create `vk_sync` module with helpers + inline tests
**Files:**
- Create: `src/vk_sync.zig`
- Modify: `build.zig:311-321` (add `vk_sync_mod` similar to `cell_instance_mod`)
- [ ] **Step 1: Create `src/vk_sync.zig` with constants, helpers, and inline tests**
Write the full file:
```zig
//! Bounded Vulkan synchronization primitives.
//!
//! Replaces unbounded vkWaitForFences / vkAcquireNextImageKHR / vkDeviceWaitIdle /
//! vkQueueWaitIdle calls. The helpers here are the ONLY path callers should use
//! for blocking Vulkan operations — the grep gate in tests/check_unbounded_vk.sh
//! enforces this at CI time.
//!
//! Motivation: NVIDIA driver 595 occasionally drops a fence signal, wedging
//! vkWaitForFences(UINT64_MAX) forever. See docs/superpowers/specs/
//! 2026-04-18-vulkan-bounded-waits-design.md for the full story.
const std = @import("std");
const vk = @import("vulkan");
pub const fence_wait_timeout_ns: u64 = 2_000_000_000; // 2s
pub const acquire_timeout_ns: u64 = 100_000_000; // 100ms
pub const SyncError = error{ VkWaitTimeout, VkAcquireTimeout };
/// Bounded fence wait. Returns error.VkWaitTimeout on timeout without touching
/// the fence. Caller may safely retry on the next iteration.
pub fn waitFenceBounded(
vkd: vk.DeviceWrapper,
device: vk.Device,
fence: vk.Fence,
) !void {
const result = try vkd.waitForFences(device, 1, @ptrCast(&fence), .true, fence_wait_timeout_ns);
if (result == .timeout) return error.VkWaitTimeout;
}
/// Bounded image acquire. Returns the acquired image_index on success.
/// Folds VK_SUBOPTIMAL_KHR into error.OutOfDateKHR (matches existing callers,
/// which already collapse the two via swapchainNeedsRebuild).
/// Returns error.VkAcquireTimeout on VK_TIMEOUT or VK_NOT_READY.
pub fn acquireImageBounded(
vkd: vk.DeviceWrapper,
device: vk.Device,
swapchain: vk.SwapchainKHR,
semaphore: vk.Semaphore,
) !u32 {
const acquire = vkd.acquireNextImageKHR(
device,
swapchain,
acquire_timeout_ns,
semaphore,
.null_handle,
) catch |err| switch (err) {
error.OutOfDateKHR => return error.OutOfDateKHR,
else => return err,
};
switch (acquire.result) {
.timeout, .not_ready => return error.VkAcquireTimeout,
.suboptimal_khr => return error.OutOfDateKHR,
.success => return acquire.image_index,
else => return acquire.image_index, // unexpected but non-error; trust the image_index
}
}
/// Bounded device-idle wait. For mid-flight resyncs where blocking forever
/// would be wrong. Returns error.VkWaitTimeout on timeout.
pub fn waitIdleBounded(vkd: vk.DeviceWrapper, device: vk.Device, timeout_ns: u64) !void {
// vkDeviceWaitIdle has no timeout parameter — we emulate by waiting on a
// newly-created fence submitted as a no-op, then waiting with our timeout.
// This is the minimum-cost approximation; for cases that need true idle,
// callers should use waitIdleForShutdown.
_ = timeout_ns;
_ = vkd;
_ = device;
@compileError("waitIdleBounded: not used in v1, left as a stub. Remove this compileError and implement the fence-based emulation if a caller appears.");
}
/// Unbounded device-idle wait, named to make shutdown-drain intent obvious
/// at the call site. Logs (but swallows) device-lost on shutdown since it is
/// unactionable.
pub fn waitIdleForShutdown(vkd: vk.DeviceWrapper, device: vk.Device) void {
vkd.deviceWaitIdle(device) catch |err| {
std.log.warn("waitIdleForShutdown: {s}", .{@errorName(err)});
};
}
/// Bounded queue-idle wait. Same shape as waitIdleBounded.
pub fn queueWaitIdleBounded(vkd: vk.DeviceWrapper, queue: vk.Queue, timeout_ns: u64) !void {
_ = queue;
_ = timeout_ns;
_ = vkd;
@compileError("queueWaitIdleBounded: not used in v1, left as a stub. Remove this compileError and implement if a caller appears.");
}
// --- logging ---
const TimeoutKind = enum { fence, acquire, atlas };
var vk_timeout_count: std.atomic.Value(u64) = .init(0);
var last_log_ns: std.atomic.Value(i128) = .init(0);
const log_window_ns: i128 = 5 * std.time.ns_per_s;
pub fn logVkTimeout(src: std.builtin.SourceLocation, kind: TimeoutKind) void {
const n = vk_timeout_count.fetchAdd(1, .monotonic) + 1;
const now = std.time.nanoTimestamp();
const last = last_log_ns.load(.monotonic);
if (n == 1 or (now - last) > log_window_ns) {
last_log_ns.store(now, .monotonic);
std.log.warn(
"vk timeout #{} ({s}) at {s}:{d} — driver may be wedged",
.{ n, @tagName(kind), src.file, src.line },
);
}
}
// --- test helpers (internal; exposed only for inline tests) ---
fn resetLogStateForTesting() void {
vk_timeout_count.store(0, .monotonic);
last_log_ns.store(0, .monotonic);
}
// --- tests ---
test "constants have expected values" {
try std.testing.expectEqual(@as(u64, 2_000_000_000), fence_wait_timeout_ns);
try std.testing.expectEqual(@as(u64, 100_000_000), acquire_timeout_ns);
}
test "logVkTimeout rate-limits to one line per window" {
// We can't easily capture std.log.warn output, but we can verify the
// counter and last_log_ns state transitions match the rate-limit logic.
resetLogStateForTesting();
// First call always logs.
logVkTimeout(@src(), .fence);
try std.testing.expectEqual(@as(u64, 1), vk_timeout_count.load(.monotonic));
const t1 = last_log_ns.load(.monotonic);
try std.testing.expect(t1 > 0);
// Immediate second call: counter increments, last_log_ns stays (within 5s window).
logVkTimeout(@src(), .fence);
try std.testing.expectEqual(@as(u64, 2), vk_timeout_count.load(.monotonic));
try std.testing.expectEqual(t1, last_log_ns.load(.monotonic));
// 100 more calls in tight loop: counter grows, last_log_ns still stays.
for (0..100) |_| logVkTimeout(@src(), .fence);
try std.testing.expectEqual(@as(u64, 102), vk_timeout_count.load(.monotonic));
try std.testing.expectEqual(t1, last_log_ns.load(.monotonic));
}
test "logVkTimeout re-fires after simulated window elapsed" {
resetLogStateForTesting();
logVkTimeout(@src(), .acquire);
const t1 = last_log_ns.load(.monotonic);
// Simulate window expiry by rewinding last_log_ns past the 5s threshold.
last_log_ns.store(t1 - 6 * std.time.ns_per_s, .monotonic);
logVkTimeout(@src(), .acquire);
const t2 = last_log_ns.load(.monotonic);
try std.testing.expect(t2 > t1 - 6 * std.time.ns_per_s);
try std.testing.expectEqual(@as(u64, 2), vk_timeout_count.load(.monotonic));
}
```
Write this to `/home/xanderle/code/rad/waystty/src/vk_sync.zig`.
- [ ] **Step 2: Wire module into `build.zig`**
After line 321 in `build.zig` (after the `cell_instance_mod` block, before the `// capture module` comment), insert:
```zig
// vk_sync module — bounded Vulkan synchronization helpers
const vk_sync_mod = b.createModule(.{
.root_source_file = b.path("src/vk_sync.zig"),
.target = target,
.optimize = optimize,
});
vk_sync_mod.addImport("vulkan", vulkan_module);
renderer_mod.addImport("vk_sync", vk_sync_mod);
renderer_test_mod.addImport("vk_sync", vk_sync_mod);
exe_mod.addImport("vk_sync", vk_sync_mod);
main_test_mod.addImport("vk_sync", vk_sync_mod);
const vk_sync_test_mod = b.createModule(.{
.root_source_file = b.path("src/vk_sync.zig"),
.target = target,
.optimize = optimize,
});
vk_sync_test_mod.addImport("vulkan", vulkan_module);
const vk_sync_tests = b.addTest(.{ .root_module = vk_sync_test_mod });
test_step.dependOn(&b.addRunArtifact(vk_sync_tests).step);
```
And extend the `capture_mod` block (currently at lines ~323-339) to also import `vk_sync`:
At the end of the `capture_mod` import chain (after `capture_mod.addImport("cell_instance", cell_instance_mod);`), add:
```zig
capture_mod.addImport("vk_sync", vk_sync_mod);
```
- [ ] **Step 3: Build and run tests**
Run: `cd /home/xanderle/code/rad/waystty && zig build test`
Expected: all existing tests pass, plus three new tests:
- `constants have expected values` — PASS
- `logVkTimeout rate-limits to one line per window` — PASS
- `logVkTimeout re-fires after simulated window elapsed` — PASS
The two `@compileError` stubs (`waitIdleBounded`, `queueWaitIdleBounded`) don't fire until something references them, so they don't block the build.
If the build fails because `vk_sync` module can't be imported by renderer/main even though they don't use it yet: that's fine, no caller imports it yet. The build only exercises the test module in this step.
- [ ] **Step 4: Commit**
```bash
cd /home/xanderle/code/rad/waystty
git add src/vk_sync.zig build.zig
git commit -m "$(cat <<'EOF'
Add vk_sync module with bounded Vulkan wait helpers
Introduces src/vk_sync.zig with waitFenceBounded, acquireImageBounded,
waitIdleForShutdown, and a rate-limited logVkTimeout. No callers
migrated yet — follow-up commits migrate the 10 timeout sites and 14
*WaitIdle sites, then land the grep gate.
Part of issue ab6c92f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 2: Reorder `resetFences` + add errdefer re-signal in `drawClear` and `drawCells`
**Files:**
- Modify: `src/renderer.zig` — `drawClear` (1273-1342), `drawCells` (1694-1800)
**Why:** Currently `resetFences` runs before `acquireNextImageKHR`. When we introduce acquire timeouts (Task 3), a timed-out acquire would leave the fence unsignaled with no submit pending → next frame's fence wait times out forever. Fixing this now, before the timeouts land, keeps the two commits independently reviewable.
We also add an `errdefer` re-signal that covers the post-acquire / pre-submit failure window. On success: no-op. On failure in that window: submit a no-op to re-signal the fence.
**No new error types yet in this commit — we're keeping `UINT64_MAX` intact. This is purely a reorder + errdefer addition.**
- [ ] **Step 1: Add a private helper `resignalFence` on `Context` in `src/renderer.zig`**
Find the `pub fn drawClear` function at line ~1273. Immediately before it, add a private helper:
```zig
/// Submit an empty command batch that signals `fence`. Used as an
/// errdefer recovery path when acquire has succeeded, resetFences has
/// run, but we failed before queueSubmit — we need to put the fence
/// back in the signaled state so the next frame's wait succeeds.
fn resignalFence(self: *Context, fence: vk.Fence) void {
const submit_info = vk.SubmitInfo{};
_ = self.vkd.queueSubmit(
self.graphics_queue,
1,
@ptrCast(&submit_info),
fence,
) catch |err| {
std.log.warn("resignalFence: {s}", .{@errorName(err)});
};
}
```
- [ ] **Step 2: Reorder `drawClear` (renderer.zig:1273-1342)**
Replace lines 1275-1290 (the wait, reset, acquire chain) with:
```zig
// Wait for previous frame to finish
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.in_flight_fence), .true, std.math.maxInt(u64));
// Acquire next image BEFORE reset, so an acquire failure leaves the
// fence in a safe state (signaled from the prior frame).
const acquire = self.vkd.acquireNextImageKHR(
self.device,
self.swapchain,
std.math.maxInt(u64),
self.image_available,
.null_handle,
) catch |err| switch (err) {
error.OutOfDateKHR => return error.OutOfDateKHR,
else => return err,
};
if (swapchainNeedsRebuild(acquire.result)) return error.OutOfDateKHR;
const image_index = acquire.image_index;
try self.vkd.resetFences(self.device, 1, @ptrCast(&self.in_flight_fence));
errdefer self.resignalFence(self.in_flight_fence);
```
The `errdefer` will fire on any error returned after this line until `queueSubmit` succeeds. `queueSubmit` signals the fence on GPU completion, so once submit succeeds the fence is correctly signaled (or about to be); the errdefer becomes a no-op because the function returns success.
- [ ] **Step 3: Reorder `drawCells` (renderer.zig:1709-1733)**
In `drawCells`, find the block starting at line 1709 ("Wait for previous frame to finish") through line 1733 (the acquire timing-out block). Replace with:
```zig
// Wait for previous frame to finish
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.in_flight_fence), .true, std.math.maxInt(u64));
if (timing_out) |t| {
t.wait_fences_us = readTimer(&timer);
timer.reset();
}
// Acquire next image BEFORE reset, so an acquire failure leaves the
// fence in a safe state (signaled from the prior frame).
const acquire = self.vkd.acquireNextImageKHR(
self.device,
self.swapchain,
std.math.maxInt(u64),
self.image_available,
.null_handle,
) catch |err| switch (err) {
error.OutOfDateKHR => return error.OutOfDateKHR,
else => return err,
};
if (swapchainNeedsRebuild(acquire.result)) return error.OutOfDateKHR;
const image_index = acquire.image_index;
if (timing_out) |t| {
t.acquire_us = readTimer(&timer);
timer.reset();
}
try self.vkd.resetFences(self.device, 1, @ptrCast(&self.in_flight_fence));
errdefer self.resignalFence(self.in_flight_fence);
```
- [ ] **Step 4: Build and run tests**
```bash
cd /home/xanderle/code/rad/waystty && zig build test
```
Expected: all tests pass. The reorder does not change external behavior because the existing `UINT64_MAX` timeout means no path through the new code can fail in a way the errdefer would catch during normal operation.
- [ ] **Step 5: Smoke-test the binary**
```bash
cd /home/xanderle/code/rad/waystty && zig build && ./zig-out/bin/waystty
```
Type some characters, resize the window, and close. Expected: normal behavior, no crash or visual glitch.
Exit with Ctrl+D or by closing the window.
- [ ] **Step 6: Commit**
```bash
cd /home/xanderle/code/rad/waystty
git add src/renderer.zig
git commit -m "$(cat <<'EOF'
renderer: reorder resetFences after acquire, add errdefer re-signal
Preparatory refactor for bounded acquire timeouts. When acquireNextImageKHR
gains a finite timeout (next commit), the existing ordering (reset → acquire)
would leave in_flight_fence unsignaled with no submit pending on a timeout,
deadlocking future waits. Reorder to (acquire → reset) and add a private
resignalFence helper that the errdefer path uses to cover the tiny
post-reset / pre-submit failure window.
Part of issue ab6c92f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 3: Migrate 10 wait/acquire call sites to `vk_sync` helpers + handle timeouts in `main.zig`
**Files:**
- Modify: `src/renderer.zig` — 7 sites
- Modify: `src/main.zig` — 3 sites + new timeout error arm
- [ ] **Step 1: Import `vk_sync` in `src/renderer.zig`**
Near the top of `src/renderer.zig`, alongside the other imports (search for `const vk = @import("vulkan");`), add:
```zig
const vk_sync = @import("vk_sync");
```
- [ ] **Step 2: Migrate `drawClear` waits (renderer.zig:1275, 1279)**
In `drawClear` (lines 1273-1342), replace the two blocking calls we reordered in Task 2.
Replace:
```zig
// Wait for previous frame to finish
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.in_flight_fence), .true, std.math.maxInt(u64));
// Acquire next image BEFORE reset, so an acquire failure leaves the
// fence in a safe state (signaled from the prior frame).
const acquire = self.vkd.acquireNextImageKHR(
self.device,
self.swapchain,
std.math.maxInt(u64),
self.image_available,
.null_handle,
) catch |err| switch (err) {
error.OutOfDateKHR => return error.OutOfDateKHR,
else => return err,
};
if (swapchainNeedsRebuild(acquire.result)) return error.OutOfDateKHR;
const image_index = acquire.image_index;
```
With:
```zig
try vk_sync.waitFenceBounded(self.vkd, self.device, self.in_flight_fence);
const image_index = try vk_sync.acquireImageBounded(self.vkd, self.device, self.swapchain, self.image_available);
```
(`swapchainNeedsRebuild` folding is now done inside `acquireImageBounded`, so the post-call check is gone.)
- [ ] **Step 3: Migrate `uploadAtlasRegion` wait + add errdefer re-signal (renderer.zig:1478-1479)**
Find lines 1478-1479:
```zig
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence), .true, std.math.maxInt(u64));
try self.vkd.resetFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence));
```
Replace with:
```zig
try vk_sync.waitFenceBounded(self.vkd, self.device, self.atlas_transfer_fence);
try self.vkd.resetFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence));
errdefer self.resignalFence(self.atlas_transfer_fence);
```
The `errdefer` covers the same post-reset / pre-submit failure window as in `drawCells` / `drawClear` (Task 2). Between line 1479 and the `queueSubmit` at line ~1574, several calls can fail (`mapMemory`, `resetCommandBuffer`, `beginCommandBuffer`, `endCommandBuffer`, `queueSubmit` itself). Without the re-signal, a single such failure would leave `atlas_transfer_fence` unsignaled forever, and every subsequent atlas upload would time out — permanently breaking glyph rendering for the session. The re-signal restores the invariant so the next upload behaves normally.
On wait timeout: we return before the reset runs, so the fence stays in its prior-submit-completed state. The re-signal `errdefer` only fires after reset.
- [ ] **Step 4: Migrate `drawCells` waits (renderer.zig:1710, 1718)**
Same pattern as `drawClear`. Replace:
```zig
// Wait for previous frame to finish
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.in_flight_fence), .true, std.math.maxInt(u64));
if (timing_out) |t| {
t.wait_fences_us = readTimer(&timer);
timer.reset();
}
// Acquire next image BEFORE reset, so an acquire failure leaves the
// fence in a safe state (signaled from the prior frame).
const acquire = self.vkd.acquireNextImageKHR(
self.device,
self.swapchain,
std.math.maxInt(u64),
self.image_available,
.null_handle,
) catch |err| switch (err) {
error.OutOfDateKHR => return error.OutOfDateKHR,
else => return err,
};
if (swapchainNeedsRebuild(acquire.result)) return error.OutOfDateKHR;
const image_index = acquire.image_index;
if (timing_out) |t| {
t.acquire_us = readTimer(&timer);
timer.reset();
}
```
With:
```zig
try vk_sync.waitFenceBounded(self.vkd, self.device, self.in_flight_fence);
if (timing_out) |t| {
t.wait_fences_us = readTimer(&timer);
timer.reset();
}
const image_index = try vk_sync.acquireImageBounded(self.vkd, self.device, self.swapchain, self.image_available);
if (timing_out) |t| {
t.acquire_us = readTimer(&timer);
timer.reset();
}
```
- [ ] **Step 5: Migrate `renderToOffscreen` waits + add errdefer re-signal for capture_fence (renderer.zig:1818, 1824-1825, 1941)**
Find line 1818:
```zig
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.in_flight_fence), .true, std.math.maxInt(u64));
```
Replace with:
```zig
try vk_sync.waitFenceBounded(self.vkd, self.device, self.in_flight_fence);
```
**This is the critical ordering requirement from spec Module 3**: the wait-before-uploadInstances must propagate the timeout error before `uploadInstances` (line ~1821) mutates shared state. Since we use `try`, propagation happens before the upload — verify by visual inspection that `uploadInstances` is called on a line AFTER the `try vk_sync.waitFenceBounded`.
Find lines 1824-1825 (wait + reset for capture_fence):
```zig
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.capture_fence), .true, std.math.maxInt(u64));
try self.vkd.resetFences(self.device, 1, @ptrCast(&self.capture_fence));
```
Replace with:
```zig
try vk_sync.waitFenceBounded(self.vkd, self.device, self.capture_fence);
try self.vkd.resetFences(self.device, 1, @ptrCast(&self.capture_fence));
errdefer self.resignalFence(self.capture_fence);
```
Same rationale as uploadAtlasRegion: reset → record → submit has several failure points; re-signal on failure preserves the fence invariant. Capture path is not daily-driver but the pattern should be uniform.
Find line 1941:
```zig
_ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.capture_fence), .true, std.math.maxInt(u64));
```
Replace with:
```zig
try vk_sync.waitFenceBounded(self.vkd, self.device, self.capture_fence);
```
No re-signal needed here — this is a post-submit wait (waiting for the capture to complete so we can read back the image), not part of a wait→reset→submit chain.
- [ ] **Step 6: Import `vk_sync` in `src/main.zig`**
Near the top of `src/main.zig`, alongside `const vk = @import("vulkan");` (line 9), add:
```zig
const vk_sync = @import("vk_sync");
```
- [ ] **Step 7: Migrate `drawTextCoverageCompareFrame` waits (main.zig:1364, 1373)**
Read lines 1360-1390 of `src/main.zig` first to confirm the current shape.
Find the 7-line call at line 1364:
```zig
_ = try ctx.vkd.waitForFences(
ctx.device,
1,
@ptrCast(&ctx.in_flight_fence),
.true,
std.math.maxInt(u64),
);
```
Replace with:
```zig
try vk_sync.waitFenceBounded(ctx.vkd, ctx.device, ctx.in_flight_fence);
```
Find the acquire at line 1373 (multi-line call — read ~1373-1383 to see it, then replace the full call + any post-check with):
```zig
const image_index = try vk_sync.acquireImageBounded(ctx.vkd, ctx.device, ctx.swapchain, ctx.image_available);
```
If the existing code has a `resetFences` before the acquire, move it to after (same pattern as drawClear/drawCells in Task 2). If this is a bench/smoke path that doesn't reset the fence between frames (i.e. a single-shot call), leave the reset where it was — the invariant only matters across multiple frames.
- [ ] **Step 8: Add timeout error arm in `main.zig` `runTerminal` render loop (main.zig:~679)**
Find the `drawCells` call in `runTerminal` and its error switch at lines 673-690:
```zig
ctx.drawCells(
render_cache.total_instance_count,
.{ @floatFromInt(cell_w), @floatFromInt(cell_h) },
default_bg,
baseline_coverage,
if (is_bench) &submit_timing else null,
) catch |err| switch (err) {
error.OutOfDateKHR => {
_ = try ctx.vkd.deviceWaitIdle(ctx.device);
const buf_w = window.width * @as(u32, @intCast(geom.buffer_scale));
const buf_h = window.height * @as(u32, @intCast(geom.buffer_scale));
try ctx.recreateSwapchain(buf_w, buf_h);
frame_loop.forceArm();
render_pending = true;
continue;
},
else => return err,
};
```
Add a `VkWaitTimeout`/`VkAcquireTimeout` arm. Replace the switch with:
```zig
ctx.drawCells(
render_cache.total_instance_count,
.{ @floatFromInt(cell_w), @floatFromInt(cell_h) },
default_bg,
baseline_coverage,
if (is_bench) &submit_timing else null,
) catch |err| switch (err) {
error.OutOfDateKHR => {
_ = try ctx.vkd.deviceWaitIdle(ctx.device);
const buf_w = window.width * @as(u32, @intCast(geom.buffer_scale));
const buf_h = window.height * @as(u32, @intCast(geom.buffer_scale));
try ctx.recreateSwapchain(buf_w, buf_h);
frame_loop.forceArm();
render_pending = true;
continue;
},
error.VkWaitTimeout, error.VkAcquireTimeout => {
vk_sync.logVkTimeout(@src(), .fence);
render_pending = true;
continue;
},
else => return err,
};
```
- [ ] **Step 9: Handle atlas upload timeout in the atlas upload caller (main.zig:591-606)**
Find the atlas upload block:
```zig
if (atlas.dirty) {
const y_start = atlas.last_uploaded_y;
const y_end = atlas.cursor_y + atlas.row_height;
if (y_start < y_end) {
try ctx.uploadAtlasRegion(
atlas.pixels,
y_start,
y_end,
atlas.needs_full_upload,
);
atlas.last_uploaded_y = atlas.cursor_y;
atlas.needs_full_upload = false;
render_cache.layout_dirty = true;
}
atlas.dirty = false;
}
```
Replace the `try ctx.uploadAtlasRegion(...)` with a `catch` that handles `VkWaitTimeout`:
```zig
if (atlas.dirty) {
const y_start = atlas.last_uploaded_y;
const y_end = atlas.cursor_y + atlas.row_height;
if (y_start < y_end) {
ctx.uploadAtlasRegion(
atlas.pixels,
y_start,
y_end,
atlas.needs_full_upload,
) catch |err| switch (err) {
error.VkWaitTimeout => {
vk_sync.logVkTimeout(@src(), .atlas);
render_pending = true;
continue;
},
else => return err,
};
atlas.last_uploaded_y = atlas.cursor_y;
atlas.needs_full_upload = false;
render_cache.layout_dirty = true;
}
atlas.dirty = false;
}
```
The `continue` re-enters the render loop. `atlas.dirty` stays `true` because the `atlas.dirty = false` line is below the failing path.
- [ ] **Step 10: Build and run tests**
```bash
cd /home/xanderle/code/rad/waystty && zig build test
```
Expected: all tests pass.
- [ ] **Step 11: Smoke-test the binary**
```bash
cd /home/xanderle/code/rad/waystty && zig build && ./zig-out/bin/waystty
```
Type characters, resize, type more, close. Expected: no visible behavior change.
- [ ] **Step 12: Commit**
```bash
cd /home/xanderle/code/rad/waystty
git add src/renderer.zig src/main.zig
git commit -m "$(cat <<'EOF'
Migrate 10 Vulkan wait/acquire sites to bounded helpers
Replaces unbounded waitForFences/acquireNextImageKHR calls in drawClear,
drawCells, uploadAtlasRegion, renderToOffscreen (three waits), and
drawTextCoverageCompareFrame with vk_sync.waitFenceBounded /
acquireImageBounded. Adds VkWaitTimeout/VkAcquireTimeout error arms in
main.zig that log via vk_sync.logVkTimeout, mark the frame dirty, and
retry on the next loop iteration. Atlas upload timeouts propagate the
dirty flag via the existing pre-assign guard.
Part of issue ab6c92f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 4: Migrate 14 `*WaitIdle` sites to `waitIdleForShutdown`
**Files:**
- Modify: `src/renderer.zig` — `deviceWaitIdle` at lines 1136, 1262; `queueWaitIdle` at line 1458
- Modify: `src/main.zig` — `deviceWaitIdle` at lines 424, 462, 479, 681, 717, 2501, 2536, 2551, 2637, 2647, 3082
14 sites total: 13 `deviceWaitIdle` + 1 `queueWaitIdle`. All are currently unbounded; migrating to `waitIdleForShutdown` (which wraps `deviceWaitIdle`) preserves behavior while putting intent at the call site.
The `queueWaitIdle` at renderer.zig:1458 currently uses a queue handle, not a device handle. Since we only have a `waitIdleForShutdown` for devices (not queues — `queueWaitIdleBounded` was stubbed as unused), migrate this one to use the device-wide form instead (`waitIdleForShutdown(self.vkd, self.device)`). This is a behavior widening — waiting on the device is stricter than waiting on one queue, but this code path is a shutdown/deinit-ish drain and the wider wait is safe. Verify by reading the surrounding context.
- [ ] **Step 1: Read the `queueWaitIdle` site for safety check**
Read `/home/xanderle/code/rad/waystty/src/renderer.zig` lines 1445-1475.
Confirm that the `queueWaitIdle` at line 1458 is in a teardown/drain context where widening to `deviceWaitIdle` is acceptable. If it's in a hot path, stop and raise — the plan may need adjustment.
If it's in a clearly-shutdown context (e.g., followed by destroy calls), continue.
- [ ] **Step 2: Migrate `deviceWaitIdle` sites in `src/renderer.zig`**
Find line 1136:
```zig
_ = self.vkd.deviceWaitIdle(self.device) catch {};
```
Replace with:
```zig
vk_sync.waitIdleForShutdown(self.vkd, self.device);
```
Find line 1262:
```zig
_ = try self.vkd.deviceWaitIdle(self.device);
```
Replace with:
```zig
vk_sync.waitIdleForShutdown(self.vkd, self.device);
```
(Note: `waitIdleForShutdown` returns `void`, not `!void`, and logs errors internally. The `try` goes away; any error is swallowed with a log line. This matches the existing `catch {}` behavior at line 1136 and widens the behavior at line 1262 from "crash on deinit error" to "log on deinit error" — the latter is more appropriate.)
- [ ] **Step 3: Migrate `queueWaitIdle` site in `src/renderer.zig`**
Find line 1458:
```zig
try self.vkd.queueWaitIdle(self.graphics_queue);
```
Replace with:
```zig
vk_sync.waitIdleForShutdown(self.vkd, self.device);
```
Again, dropping the `try` — errors are swallowed.
- [ ] **Step 4: Migrate all `deviceWaitIdle` sites in `src/main.zig`**
Per the spec, **all 11 sites migrate mechanically** to `waitIdleForShutdown`. The follow-up ticket (Task 7) audits which of these should actually become bounded later; for now, the named helper preserves the existing unbounded behavior while documenting intent at the call site and satisfying the grep gate.
Run this grep to list the current sites (line numbers may have shifted during earlier tasks):
```bash
cd /home/xanderle/code/rad/waystty && grep -n "ctx.vkd.deviceWaitIdle" src/main.zig
```
Expected: 11 matches (originally at 424, 462, 479, 681, 717, 2501, 2536, 2551, 2637, 2647, 3082 — possibly shifted by ±a few).
For each line, replace:
```zig
_ = try ctx.vkd.deviceWaitIdle(ctx.device);
```
With:
```zig
vk_sync.waitIdleForShutdown(ctx.vkd, ctx.device);
```
(Preserve the original leading indentation. The pattern is identical on every site per the pre-plan grep output — one-liner assignment discarded via `_ =` with a `try`.)
`sed` one-shot, since the pattern is uniform and all sites migrate:
```bash
cd /home/xanderle/code/rad/waystty
sed -i 's|_ = try ctx\.vkd\.deviceWaitIdle(ctx\.device);|vk_sync.waitIdleForShutdown(ctx.vkd, ctx.device);|g' src/main.zig
```
After the migration, verify no raw calls remain:
```bash
grep -n "ctx.vkd.deviceWaitIdle" src/main.zig
```
Expected: zero results.
**Note on the recovery arms (former lines 424, 462, 479, 681)**: These are mid-flight paths (scale change, resize, OutOfDateKHR), not shutdown drains. Migrating to `waitIdleForShutdown` keeps them unbounded — if the driver wedges here, we still hang. This is a known limitation tracked by the follow-up ticket opened in Task 7; the task there is to implement `waitIdleBounded` (currently stubbed) and migrate these four specific sites to it. For v1, the behavior-preserving migration is intentional.
- [ ] **Step 5: Build and run tests**
```bash
cd /home/xanderle/code/rad/waystty && zig build test
```
Expected: all tests pass. Build may fail if any of the `try` → no-try migrations break type inference in the surrounding function — if so, the site is in a context that genuinely needed the `!void`; re-examine that site and either keep the `try`-compatible form (keep as-is, flag with the grep gate) or adjust.
- [ ] **Step 6: Smoke-test the binary**
```bash
cd /home/xanderle/code/rad/waystty && zig build && ./zig-out/bin/waystty
```
Expected: no visible change.
Resize the window multiple times (triggers the `OutOfDateKHR` path that still uses `deviceWaitIdle` at line 681). Expected: resize works normally.
- [ ] **Step 7: Commit**
```bash
cd /home/xanderle/code/rad/waystty
git add src/renderer.zig src/main.zig
git commit -m "$(cat <<'EOF'
Migrate 14 *WaitIdle sites to vk_sync.waitIdleForShutdown
Mechanically migrates 13 deviceWaitIdle + 1 queueWaitIdle sites to
the named waitIdleForShutdown helper. Preserves existing unbounded
behavior while documenting intent at the call site and satisfying
the upcoming grep gate.
Four of these sites are mid-flight recovery paths (scale change,
resize, OutOfDateKHR) rather than shutdown drains; those are flagged
in a follow-up ticket for migration to a bounded variant once
waitIdleBounded is implemented.
Part of issue ab6c92f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 5: Add grep gate shell script + wire to build.zig test step
**Files:**
- Create: `tests/check_unbounded_vk.sh`
- Modify: `build.zig` (add the script to the test step)
The grep gate fails CI if any source file outside `src/vk_sync.zig` calls `vkd.waitForFences`, `vkd.acquireNextImageKHR`, `vkd.deviceWaitIdle`, or `vkd.queueWaitIdle` directly. After Task 4 migrates all 14 sites to `waitIdleForShutdown` (which wraps `deviceWaitIdle` inside `vk_sync.zig`), the gate will pass with zero violations. No allowlist mechanism is needed.
- [ ] **Step 1: Create `tests/check_unbounded_vk.sh`**
Write the full script:
```bash
#!/usr/bin/env bash
# Grep gate: fail if any source file outside src/vk_sync.zig calls
# Vulkan blocking primitives directly. All such calls must go through
# src/vk_sync.zig helpers (waitFenceBounded, acquireImageBounded,
# waitIdleForShutdown, etc.).
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
cd "$REPO_ROOT"
PATTERNS=(
'vkd\.waitForFences'
'vkd\.acquireNextImageKHR'
'vkd\.deviceWaitIdle'
'vkd\.queueWaitIdle'
)
# Find all .zig files in src/ except vk_sync.zig
mapfile -t files < <(find src -name '*.zig' ! -path 'src/vk_sync.zig' | sort)
violations=0
violation_lines=()
for file in "${files[@]}"; do
for pat in "${PATTERNS[@]}"; do
while IFS= read -r hit; do
violation_lines+=("$file:$hit")
violations=$((violations + 1))
done < <(grep -nE "$pat" "$file" || true)
done
done
if [[ $violations -gt 0 ]]; then
echo "ERROR: $violations unbounded Vulkan wait call(s) found outside src/vk_sync.zig:" >&2
for line in "${violation_lines[@]}"; do
echo " $line" >&2
done
echo "" >&2
echo "Use src/vk_sync.zig helpers instead:" >&2
echo " waitFenceBounded — replace vkd.waitForFences" >&2
echo " acquireImageBounded — replace vkd.acquireNextImageKHR" >&2
echo " waitIdleForShutdown — replace vkd.deviceWaitIdle / queueWaitIdle" >&2
exit 1
fi
echo "vk grep gate: ok (${#files[@]} files, no violations)"
```
Then make it executable:
```bash
chmod +x /home/xanderle/code/rad/waystty/tests/check_unbounded_vk.sh
```
- [ ] **Step 2: Wire into `build.zig` test step**
Near the bottom of the `build()` function in `build.zig`, after all the other `test_step.dependOn(...)` calls (search for the last `test_step.dependOn`), add:
```zig
const check_unbounded_vk = b.addSystemCommand(&.{ "tests/check_unbounded_vk.sh" });
test_step.dependOn(&check_unbounded_vk.step);
```
- [ ] **Step 3: Run the grep gate in isolation**
```bash
cd /home/xanderle/code/rad/waystty && ./tests/check_unbounded_vk.sh
```
Expected output:
```
vk grep gate: ok (N files, no violations)
```
Where N is the count of `.zig` files in `src/` minus 1 (for `vk_sync.zig`).
If it fails, the message lists the violating sites. Any match indicates a raw Vulkan wait call that Task 3 or Task 4 missed — migrate it to the appropriate helper.
- [ ] **Step 4: Run the full test suite**
```bash
cd /home/xanderle/code/rad/waystty && zig build test
```
Expected: all tests pass, grep gate reports "vk grep gate: ok".
- [ ] **Step 5: Verify the gate catches a regression**
Temporarily add to any `.zig` file outside `src/vk_sync.zig`:
```zig
// REGRESSION TEST — DELETE THIS LINE
const _ignore = @compileError("unused"); // prevent use
// fake: self.vkd.waitForFences(...)
```
Actually, simpler: just add this line to `src/renderer.zig` (any function body):
```zig
_ = self.vkd.waitForFences; // REGRESSION TEST — DELETE
```
Run `zig build test`. Expected: build fails or grep gate reports one violation pointing at that line.
Remove the regression line. Run `zig build test` again. Expected: passes.
- [ ] **Step 6: Commit**
```bash
cd /home/xanderle/code/rad/waystty
git add tests/check_unbounded_vk.sh build.zig
git commit -m "$(cat <<'EOF'
Add grep gate that forbids unbounded Vulkan waits outside vk_sync
tests/check_unbounded_vk.sh scans src/ for direct calls to
vkd.waitForFences / acquireNextImageKHR / deviceWaitIdle /
queueWaitIdle and fails CI if any appear outside src/vk_sync.zig.
Wired into zig build test.
Part of issue ab6c92f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 6: Add backoff to the timeout error arm
**Files:**
- Modify: `src/main.zig` — the `error.VkWaitTimeout, error.VkAcquireTimeout` arm in `runTerminal`
- [ ] **Step 1: Add a `consecutive_vk_timeouts` counter**
In `src/main.zig` `runTerminal`, find the render loop start. Locate the existing loop-state variables (e.g. `var render_pending: bool = false;` near the top of `runTerminal`). Add alongside them:
```zig
var consecutive_vk_timeouts: u32 = 0;
```
- [ ] **Step 2: Modify the timeout arm to apply backoff and reset on success**
Find the block added in Task 3 Step 8:
```zig
error.VkWaitTimeout, error.VkAcquireTimeout => {
vk_sync.logVkTimeout(@src(), .fence);
render_pending = true;
continue;
},
```
Replace with:
```zig
error.VkWaitTimeout, error.VkAcquireTimeout => {
vk_sync.logVkTimeout(@src(), .fence);
consecutive_vk_timeouts +|= 1;
const backoff_us: u64 = @min(@as(u64, consecutive_vk_timeouts) * 5_000, 100_000);
std.time.sleep(backoff_us * std.time.ns_per_us);
render_pending = true;
continue;
},
```
(`+|=` is saturating-add, avoiding overflow if somehow a billion timeouts accumulate.)
Also update the atlas timeout arm from Task 3 Step 9 to participate in the same backoff:
```zig
error.VkWaitTimeout => {
vk_sync.logVkTimeout(@src(), .atlas);
consecutive_vk_timeouts +|= 1;
const backoff_us: u64 = @min(@as(u64, consecutive_vk_timeouts) * 5_000, 100_000);
std.time.sleep(backoff_us * std.time.ns_per_us);
render_pending = true;
continue;
},
```
- [ ] **Step 3: Reset the counter on success**
Find the end of a successful drawCells frame. In `runTerminal`, the successful path continues past the `catch` switch and eventually gets to `frame_ring.push(frame_timing);` or similar (around line 698). Immediately after the `catch` switch on `drawCells` returns (i.e., just after the `}` that closes the switch), but before the rest of the frame completes, add:
```zig
consecutive_vk_timeouts = 0;
```
The cleanest spot is on the line immediately after the `};` that closes the `drawCells` switch:
```zig
}) catch |err| switch (err) {
// ... existing arms ...
error.VkWaitTimeout, error.VkAcquireTimeout => {
// ...
continue;
},
else => return err,
};
consecutive_vk_timeouts = 0; // <-- add this
frame_timing.gpu_submit_us = usFromTimer(§ion_timer);
```
(Verify the exact location by reading main.zig lines 689-700.)
- [ ] **Step 4: Build and run tests**
```bash
cd /home/xanderle/code/rad/waystty && zig build test
```
Expected: all tests pass.
- [ ] **Step 5: Smoke-test**
```bash
cd /home/xanderle/code/rad/waystty && zig build && ./zig-out/bin/waystty
```
Type, resize, close. Expected: no visible behavior change (since we never trigger the timeout path in normal operation).
- [ ] **Step 6: Commit**
```bash
cd /home/xanderle/code/rad/waystty
git add src/main.zig
git commit -m "$(cat <<'EOF'
main: add exponential backoff on Vulkan timeout retries
If the driver is genuinely wedged, the timeout-retry loop would spin
at full speed. Track consecutive VkWaitTimeout / VkAcquireTimeout
events and sleep 5ms * N (capped at 100ms) before retrying. Counter
resets on any successful frame.
Closes issue ab6c92f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 7: Ticket bookkeeping
**Files:**
- No source files modified.
Close duplicate ticket, open follow-up, move `ab6c92f0` to review state once patch is filed.
- [ ] **Step 1: Sync collab state**
```bash
cd /home/xanderle/code/rad/waystty && git-collab sync
```
Expected: "Sync complete." No errors.
- [ ] **Step 2: Close the duplicate ticket**
```bash
cd /home/xanderle/code/rad/waystty && git-collab issue close 793f491a --reason "[claude 2026-04-18] Duplicate of ab6c92f0 (older by ~1 minute, identical body). Closing this one; work tracked on ab6c92f0."
git-collab sync
```
- [ ] **Step 3: Open the follow-up ticket for the mid-flight deviceWaitIdle audit**
```bash
cd /home/xanderle/code/rad/waystty
NEW=$(git-collab issue open \
--title "Audit mid-flight waitIdleForShutdown sites for bounded semantics" \
--body "[claude 2026-04-18] Opened as follow-up to ab6c92f0 (Vulkan bounded waits).
The bounded-waits work migrated all 14 deviceWaitIdle / queueWaitIdle
sites to vk_sync.waitIdleForShutdown (which wraps vkd.deviceWaitIdle
with an unbounded wait and swallows errors to a log line). This
preserved existing behavior while satisfying the grep gate.
However, four of those sites are mid-flight recovery paths — not
shutdown drains — where the driver could wedge exactly the way
ab6c92f0 documented:
- src/main.zig (scale_pending arm) — before rebuildFaceForScale
- src/main.zig (resize arm, grid-changed branch) — before recreateSwapchain
- src/main.zig (resize arm, grid-unchanged branch) — before recreateSwapchain
- src/main.zig (OutOfDateKHR arm in drawCells catch) — before recreateSwapchain
All four wait unbounded while the driver may be wedged. If the hang
mode from ab6c92f0 fires during a resize or scale change, we'll freeze
again with the same symptoms.
Scope of this ticket:
1. Implement vk_sync.waitIdleBounded (currently a @compileError stub).
The fence-based emulation: submit a no-op to the graphics queue,
wait on its fence with a timeout. Return error.VkWaitTimeout on
timeout.
2. Migrate the four mid-flight sites above from waitIdleForShutdown to
waitIdleBounded (keep the other 10 shutdown/init sites as-is).
3. In each caller, handle the new timeout error: log, skip the recovery
step, retry next iteration. The existing recreateSwapchain path is
already robust to being called repeatedly.
Priority: low. The NVIDIA 595 driver flake observed on 2026-04-18 was
a fence-wait wedge, not a resize/scale wedge. Resize/scale paths have
not been observed to hang. File so the work isn't lost." \
| awk '{print $NF}')
git-collab issue label "$NEW" backlog
git-collab sync
echo "Opened follow-up: $NEW"
```
- [ ] **Step 4: Move ab6c92f0 to review state**
At this point, all the implementation work is committed on `main` (or a branch). The issue should transition `backlog → planning → dev → review`.
```bash
cd /home/xanderle/code/rad/waystty
git-collab issue unlabel ab6c92f0 backlog
git-collab issue label ab6c92f0 review
git-collab issue comment ab6c92f0 --body "[claude 2026-04-18] Implementation complete. Spec: docs/superpowers/specs/2026-04-18-vulkan-bounded-waits-design.md. Plan: docs/superpowers/plans/2026-04-18-vulkan-bounded-waits.md. Six commits on main (or feature branch; see git log). Follow-up for the OutOfDateKHR recovery arm is the new ticket opened by this work."
git-collab sync
```
(If work is on a feature branch, file a patch with `git-collab patch create --fixes ab6c92f0 ...` instead of directly labeling `review`. See the git-collab skill docs.)
---
## Self-Review Checklist
After all tasks complete, verify:
- [ ] **Spec coverage.** Every module in the spec is implemented somewhere:
- Module 1 (vk_sync.zig helpers) — Task 1
- Module 2 (caller updates, 10 sites) — Task 3
- Module 3 (fence-state invariant + reorder) — Task 2
- Module 4 (atlas timeout policy) — Task 3 Step 9
- Module 5 (logging) — Task 1 (logger) + Task 3 (call sites)
- Module 6 (backoff) — Task 6
- Module 7 (grep gate) — Task 5
- [ ] **Type consistency.** The helpers are `vk_sync.waitFenceBounded`, `vk_sync.acquireImageBounded`, `vk_sync.waitIdleForShutdown`, `vk_sync.logVkTimeout` — same names used in every task.
- [ ] **No placeholders.** No TBDs, no "implement later" in any task. Every code snippet is the code the engineer types.
- [ ] **`consecutive_vk_timeouts` is reset on success, not only at function start.** Verify Step 3 of Task 6 places the reset after the drawCells switch returns successfully.
- [ ] **Atlas path uses the same `consecutive_vk_timeouts` counter as the draw path.** Task 6 Step 2 updates both arms to `+|= 1`.
- [ ] **Grep gate passes with zero violations after Task 4 migration.** No allowlist needed; every raw `vkd.*` call is now inside `src/vk_sync.zig`.
- [ ] **All six commits follow the spec's suggested order.** Task 1 → Task 2 → Task 3 → Task 4 → Task 5 → Task 6 → (Task 7 is collab bookkeeping, not a source commit).
---
## Rollback
If any commit breaks the build or causes regressions:
```bash
cd /home/xanderle/code/rad/waystty
git log --oneline -10
# identify the offending commit SHA
git revert <SHA>
```
The plan is ordered so each commit is independently revertable: Task 1 (module only, no callers) is always safe to revert. Task 2 (reorder) is safe to revert before Task 3. Task 3 and later can be reverted individually as long as later tasks are reverted first (LIFO).
If the whole thing needs to come out at once: revert the six commits in reverse order, or use `git reset --hard <pre-work-SHA>` if the work is on a branch and nothing depends on it yet (confirm with the user before hard-reset).