fc9f9849

Add performance benchmarking and incremental atlas upload specs and plans

a73x 2026-04-10 10:17

Design specs and implementation plans for:
- Per-section frame timing with ring buffer, SIGUSR1 stats dump,
  WAYSTTY_BENCH workload, and perf/flamegraph Makefile targets
- Incremental atlas upload with ASCII precompute, dirty-region
  tracking, persistent staging buffer, and fence-based sync

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

diff --git a/docs/superpowers/plans/2026-04-10-incremental-atlas-upload-implementation.md b/docs/superpowers/plans/2026-04-10-incremental-atlas-upload-implementation.md
new file mode 100644
index 0000000..82878de
--- /dev/null
+++ b/docs/superpowers/plans/2026-04-10-incremental-atlas-upload-implementation.md
@@ -0,0 +1,483 @@
# Incremental Atlas Upload Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Reduce atlas upload cost from ~1.7ms to near-zero by precomputing ASCII glyphs at startup and uploading only dirty atlas rows incrementally.

**Architecture:** Add `last_uploaded_y` and `needs_full_upload` tracking fields to the Atlas struct in `font.zig`. Add `uploadAtlasRegion` to `renderer.zig` with a persistent staging buffer, content-preserving layout transitions, and a dedicated transfer fence. Wire the precompute loop and incremental upload into `main.zig`.

**Tech Stack:** Zig 0.15, Vulkan host-visible staging buffers, image layout transitions, fence synchronization.

---

## File Structure

- Modify: `src/font.zig`
  - Add `last_uploaded_y: u32` and `needs_full_upload: bool` to `Atlas`
  - Update `init()` and `reset()` to set these fields
- Modify: `src/renderer.zig`
  - Add persistent staging buffer + dedicated transfer command buffer + transfer fence to `Context`
  - Add `uploadAtlasRegion(pixels, y_start, y_end, full)` method
  - Keep existing `uploadAtlas` as full-upload convenience wrapper
- Modify: `src/main.zig`
  - Add ASCII precompute loop at startup
  - Replace render-loop atlas upload with incremental path

### Task 1: Add dirty-region tracking fields to Atlas with tests

**Files:**
- Modify: `src/font.zig`
- Test: `src/font.zig`

- [ ] **Step 1: Write the failing tests**

Add at the bottom of `src/font.zig`, after the existing test blocks:

```zig
test "Atlas dirty tracking fields initialized correctly" {
    var atlas = try Atlas.init(std.testing.allocator, 256, 256);
    defer atlas.deinit();

    try std.testing.expectEqual(@as(u32, 0), atlas.last_uploaded_y);
    try std.testing.expect(atlas.needs_full_upload);
}

test "Atlas dirty region covers new glyphs" {
    var atlas = try Atlas.init(std.testing.allocator, 256, 256);
    defer atlas.deinit();

    // After init, cursor_y=0, row_height=1 (for the white pixel)
    const y_start = atlas.last_uploaded_y;
    const y_end = atlas.cursor_y + atlas.row_height;
    try std.testing.expectEqual(@as(u32, 0), y_start);
    try std.testing.expect(y_end > 0);
}

test "Atlas reset restores dirty tracking fields" {
    var atlas = try Atlas.init(std.testing.allocator, 256, 256);
    defer atlas.deinit();

    // Simulate having uploaded some region
    atlas.last_uploaded_y = 50;
    atlas.needs_full_upload = false;

    atlas.reset();

    try std.testing.expectEqual(@as(u32, 0), atlas.last_uploaded_y);
    try std.testing.expect(atlas.needs_full_upload);
}
```

- [ ] **Step 2: Run test to verify it fails**

Run: `zig build test 2>&1 | head -20`
Expected: FAIL — `last_uploaded_y` field does not exist.

- [ ] **Step 3: Add the fields to Atlas**

In `src/font.zig`, add to the `Atlas` struct fields (after `dirty: bool`):

```zig
    last_uploaded_y: u32,
    needs_full_upload: bool,
```

In `Atlas.init` (the return struct literal), add:

```zig
            .last_uploaded_y = 0,
            .needs_full_upload = true,
```

In `Atlas.reset`, add at the end (after `self.dirty = true;`):

```zig
        self.last_uploaded_y = 0;
        self.needs_full_upload = true;
```

- [ ] **Step 4: Run test to verify it passes**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 5: Commit**

```bash
git add src/font.zig
git commit -m "Add dirty-region tracking fields to Atlas"
```

### Task 2: Add persistent staging buffer and transfer fence to renderer

**Files:**
- Modify: `src/renderer.zig`

- [ ] **Step 1: Add fields to Context struct**

In `src/renderer.zig`, add three new fields to the `Context` struct after `atlas_height: u32`:

```zig
    // Persistent atlas staging buffer (reused across frames)
    atlas_staging_buffer: vk.Buffer,
    atlas_staging_memory: vk.DeviceMemory,
    // Dedicated transfer command buffer + fence
    atlas_transfer_cb: vk.CommandBuffer,
    atlas_transfer_fence: vk.Fence,
```

- [ ] **Step 2: Allocate resources in Context.init**

In `Context.init`, after the atlas sampler creation and before the descriptor set update (around line 910), add:

```zig
        // --- Atlas staging buffer (persistent, reused across frames) ---
        const atlas_staging_size: vk.DeviceSize = @as(vk.DeviceSize, atlas_width) * atlas_height;
        const atlas_staging = try createHostVisibleBuffer(vki, pd_info.physical, vkd, device, atlas_staging_size, .{ .transfer_src_bit = true });
        errdefer {
            vkd.destroyBuffer(device, atlas_staging.buffer, null);
            vkd.freeMemory(device, atlas_staging.memory, null);
        }

        // --- Dedicated atlas transfer command buffer ---
        var atlas_transfer_cb: vk.CommandBuffer = undefined;
        try vkd.allocateCommandBuffers(device, &vk.CommandBufferAllocateInfo{
            .command_pool = command_pool,
            .level = .primary,
            .command_buffer_count = 1,
        }, @ptrCast(&atlas_transfer_cb));

        // --- Atlas transfer fence (starts signaled so first wait is a no-op) ---
        const atlas_transfer_fence = try vkd.createFence(device, &vk.FenceCreateInfo{
            .flags = .{ .signaled_bit = true },
        }, null);
        errdefer vkd.destroyFence(device, atlas_transfer_fence, null);
```

- [ ] **Step 3: Add new fields to the return struct**

In the return struct literal in `Context.init`, add after `.atlas_height = atlas_height`:

```zig
            .atlas_staging_buffer = atlas_staging.buffer,
            .atlas_staging_memory = atlas_staging.memory,
            .atlas_transfer_cb = atlas_transfer_cb,
            .atlas_transfer_fence = atlas_transfer_fence,
```

- [ ] **Step 4: Free resources in Context.deinit**

In `Context.deinit`, add after the atlas memory free (after `self.vkd.freeMemory(self.device, self.atlas_memory, null);`):

```zig
        self.vkd.destroyBuffer(self.device, self.atlas_staging_buffer, null);
        self.vkd.freeMemory(self.device, self.atlas_staging_memory, null);
        self.vkd.destroyFence(self.device, self.atlas_transfer_fence, null);
```

- [ ] **Step 5: Verify it compiles**

Run: `zig build 2>&1 | tail -5`
Expected: BUILD SUCCESS

- [ ] **Step 6: Run tests**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 7: Commit**

```bash
git add src/renderer.zig
git commit -m "Add persistent staging buffer and transfer fence to renderer"
```

### Task 3: Implement uploadAtlasRegion

**Files:**
- Modify: `src/renderer.zig`

- [ ] **Step 1: Add the uploadAtlasRegion method**

Add after the existing `uploadAtlas` method in `Context`:

```zig
    /// Upload a horizontal band of the atlas (y_start..y_end) to the GPU.
    /// Uses the persistent staging buffer and dedicated transfer command buffer.
    /// If `full` is true, transitions from UNDEFINED (for initial/reset uploads).
    /// Otherwise transitions from SHADER_READ_ONLY (preserves existing data).
    pub fn uploadAtlasRegion(
        self: *Context,
        pixels: []const u8,
        y_start: u32,
        y_end: u32,
        full: bool,
    ) !void {
        if (y_start >= y_end) return;

        const byte_offset: usize = @as(usize, y_start) * self.atlas_width;
        const byte_len: usize = @as(usize, y_end - y_start) * self.atlas_width;

        // Wait for any prior atlas transfer to finish before reusing staging buffer
        _ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence), .true, std.math.maxInt(u64));
        try self.vkd.resetFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence));

        // Copy dirty band into staging buffer
        const mapped = try self.vkd.mapMemory(self.device, self.atlas_staging_memory, 0, @intCast(byte_len), .{});
        @memcpy(@as([*]u8, @ptrCast(mapped))[0..byte_len], pixels[byte_offset .. byte_offset + byte_len]);
        self.vkd.unmapMemory(self.device, self.atlas_staging_memory);

        // Record transfer command
        try self.vkd.resetCommandBuffer(self.atlas_transfer_cb, .{});
        try self.vkd.beginCommandBuffer(self.atlas_transfer_cb, &vk.CommandBufferBeginInfo{
            .flags = .{ .one_time_submit_bit = true },
        });

        // Barrier: old_layout -> TRANSFER_DST
        const old_layout: vk.ImageLayout = if (full) .undefined else .shader_read_only_optimal;
        const barrier_to_transfer = vk.ImageMemoryBarrier{
            .src_access_mask = if (full) @as(vk.AccessFlags, .{}) else .{ .shader_read_bit = true },
            .dst_access_mask = .{ .transfer_write_bit = true },
            .old_layout = old_layout,
            .new_layout = .transfer_dst_optimal,
            .src_queue_family_index = vk.QUEUE_FAMILY_IGNORED,
            .dst_queue_family_index = vk.QUEUE_FAMILY_IGNORED,
            .image = self.atlas_image,
            .subresource_range = .{
                .aspect_mask = .{ .color_bit = true },
                .base_mip_level = 0,
                .level_count = 1,
                .base_array_layer = 0,
                .layer_count = 1,
            },
        };
        const src_stage: vk.PipelineStageFlags = if (full) .{ .top_of_pipe_bit = true } else .{ .fragment_shader_bit = true };
        self.vkd.cmdPipelineBarrier(
            self.atlas_transfer_cb,
            src_stage,
            .{ .transfer_bit = true },
            .{},
            0, null,
            0, null,
            1, @ptrCast(&barrier_to_transfer),
        );

        // Copy staging buffer -> image (dirty band only)
        const region = vk.BufferImageCopy{
            .buffer_offset = 0,
            .buffer_row_length = 0,
            .buffer_image_height = 0,
            .image_subresource = .{
                .aspect_mask = .{ .color_bit = true },
                .mip_level = 0,
                .base_array_layer = 0,
                .layer_count = 1,
            },
            .image_offset = .{ .x = 0, .y = @intCast(y_start), .z = 0 },
            .image_extent = .{ .width = self.atlas_width, .height = y_end - y_start, .depth = 1 },
        };
        self.vkd.cmdCopyBufferToImage(
            self.atlas_transfer_cb,
            self.atlas_staging_buffer,
            self.atlas_image,
            .transfer_dst_optimal,
            1,
            @ptrCast(&region),
        );

        // Barrier: TRANSFER_DST -> SHADER_READ_ONLY
        const barrier_to_shader = vk.ImageMemoryBarrier{
            .src_access_mask = .{ .transfer_write_bit = true },
            .dst_access_mask = .{ .shader_read_bit = true },
            .old_layout = .transfer_dst_optimal,
            .new_layout = .shader_read_only_optimal,
            .src_queue_family_index = vk.QUEUE_FAMILY_IGNORED,
            .dst_queue_family_index = vk.QUEUE_FAMILY_IGNORED,
            .image = self.atlas_image,
            .subresource_range = .{
                .aspect_mask = .{ .color_bit = true },
                .base_mip_level = 0,
                .level_count = 1,
                .base_array_layer = 0,
                .layer_count = 1,
            },
        };
        self.vkd.cmdPipelineBarrier(
            self.atlas_transfer_cb,
            .{ .transfer_bit = true },
            .{ .fragment_shader_bit = true },
            .{},
            0, null,
            0, null,
            1, @ptrCast(&barrier_to_shader),
        );

        try self.vkd.endCommandBuffer(self.atlas_transfer_cb);

        // Submit with dedicated fence (no queueWaitIdle)
        try self.vkd.queueSubmit(self.graphics_queue, 1, @ptrCast(&vk.SubmitInfo{
            .command_buffer_count = 1,
            .p_command_buffers = @ptrCast(&self.atlas_transfer_cb),
        }), self.atlas_transfer_fence);
    }
```

- [ ] **Step 2: Verify it compiles**

Run: `zig build 2>&1 | tail -5`
Expected: BUILD SUCCESS

- [ ] **Step 3: Run tests**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 4: Commit**

```bash
git add src/renderer.zig
git commit -m "Implement uploadAtlasRegion with incremental uploads"
```

### Task 4: Add ASCII precompute and wire incremental upload into main.zig

**Files:**
- Modify: `src/main.zig`

- [ ] **Step 1: Add ASCII precompute at startup**

In `src/main.zig`, replace the block at lines 171-172:

```zig
    // Upload empty atlas first (so descriptor set is valid)
    try ctx.uploadAtlas(atlas.pixels);
```

With:

```zig
    // Precompute printable ASCII glyphs (32-126) into atlas
    for (32..127) |cp| {
        _ = atlas.getOrInsert(&face, @intCast(cp)) catch |err| switch (err) {
            error.AtlasFull => break,
            else => return err,
        };
    }
    // Upload warm atlas (full upload — descriptor set needs valid data)
    try ctx.uploadAtlas(atlas.pixels);
    atlas.last_uploaded_y = atlas.cursor_y;
    atlas.needs_full_upload = false;
    atlas.dirty = false;
```

- [ ] **Step 2: Replace the render-loop atlas upload**

In `src/main.zig`, replace the atlas upload block (lines 477-482):

```zig
        // Re-upload atlas if new glyphs were added
        if (atlas.dirty) {
            try ctx.uploadAtlas(atlas.pixels);
            atlas.dirty = false;
            render_cache.layout_dirty = true;
        }
```

With:

```zig
        // Re-upload atlas if new glyphs were added (incremental)
        if (atlas.dirty) {
            const y_start = atlas.last_uploaded_y;
            const y_end = atlas.cursor_y + atlas.row_height;
            if (y_start < y_end) {
                try ctx.uploadAtlasRegion(
                    atlas.pixels,
                    y_start,
                    y_end,
                    atlas.needs_full_upload,
                );
                atlas.last_uploaded_y = atlas.cursor_y;
                atlas.needs_full_upload = false;
                render_cache.layout_dirty = true;
            }
            atlas.dirty = false;
        }
```

- [ ] **Step 3: Verify it compiles**

Run: `zig build 2>&1 | tail -5`
Expected: BUILD SUCCESS

- [ ] **Step 4: Run tests**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 5: Commit**

```bash
git add src/main.zig
git commit -m "Wire ASCII precompute and incremental atlas upload"
```

### Task 5: Full verification

**Files:**
- Test: `src/font.zig`, `src/renderer.zig`, `src/main.zig`

- [ ] **Step 1: Run the full test suite**

Run: `zig build test`
Expected: PASS

- [ ] **Step 2: Manual smoke test — normal run**

Run: `zig build run`
Expected:
- Terminal opens and shows text correctly (precomputed ASCII atlas).
- Typing normal text works. Cursor renders.
- Exit dumps frame timing stats — atlas_upload should be 0 for most frames.

- [ ] **Step 3: Manual smoke test — Unicode character**

Run inside terminal: `echo "★ ← → ★"`
Expected: Characters render correctly (incremental upload fires for the first time these codepoints appear).

- [ ] **Step 4: Manual smoke test — bench comparison**

Run: `make bench`
Expected:
- atlas_upload avg should drop significantly from the baseline ~1700us.
- Steady-state frames should show atlas_upload near 0.

- [ ] **Step 5: Commit if any fixups were needed**

```bash
git add src/font.zig src/renderer.zig src/main.zig
git commit -m "Fix verification issues for incremental atlas upload"
```

## Self-Review

- **Spec coverage:**
  - `last_uploaded_y` + `needs_full_upload` fields: Task 1
  - `reset()` sets both fields: Task 1
  - Persistent staging buffer: Task 2
  - Transfer fence (starts signaled): Task 2
  - `uploadAtlasRegion` with partial copy: Task 3
  - Layout transition: `UNDEFINED` vs `SHADER_READ_ONLY` based on `full` flag: Task 3
  - Post-copy barrier back to `SHADER_READ_ONLY`: Task 3
  - Fence wait before reusing staging buffer: Task 3
  - No `queueWaitIdle`: Task 3
  - ASCII precompute (32-126): Task 4
  - Render-loop incremental wiring with `y_start < y_end` guard: Task 4
  - `last_uploaded_y = cursor_y` (not `cursor_y + row_height`): Task 4
  - Bench comparison: Task 5
- **Placeholder scan:** No TBD/TODO markers. All code blocks are complete.
- **Type consistency:**
  - `Atlas.last_uploaded_y` and `Atlas.needs_full_upload` defined in Task 1, used in Task 4
  - `Context.atlas_staging_buffer`, `atlas_staging_memory`, `atlas_transfer_cb`, `atlas_transfer_fence` defined in Task 2, used in Task 3
  - `uploadAtlasRegion(pixels, y_start, y_end, full)` defined in Task 3, called in Task 4
  - Existing `uploadAtlas` kept unchanged — used for initial full upload in Task 4
diff --git a/docs/superpowers/plans/2026-04-10-performance-benchmarking-implementation.md b/docs/superpowers/plans/2026-04-10-performance-benchmarking-implementation.md
new file mode 100644
index 0000000..fcc5b99
--- /dev/null
+++ b/docs/superpowers/plans/2026-04-10-performance-benchmarking-implementation.md
@@ -0,0 +1,711 @@
# Performance Benchmarking Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Add per-section frame timing instrumentation, a reproducible bench workload, and a perf/flamegraph target so we can measure responsiveness before and after fixing known bottlenecks.

**Architecture:** A 256-entry ring buffer of `FrameTiming` structs records microsecond timings for five render-loop sections. Stats are dumped to stderr on SIGUSR1 and clean exit. A `WAYSTTY_BENCH=1` env var swaps the user's shell for a fixed workload script. A `Makefile` provides `bench` and `profile` targets.

**Tech Stack:** Zig 0.15, `std.time.Timer`, `std.posix.sigaction`, `perf record`, `flamegraph.pl`/`stackcollapse-perf.pl`

---

## File Structure

- Modify: `src/main.zig`
  - `FrameTiming` struct, `FrameTimingRing` ring buffer, `computeStats` helper, `formatStats` printer
  - SIGUSR1 signal handler that sets an atomic flag
  - Section timers wrapping each render-loop phase
  - `WAYSTTY_BENCH` env var check in the shell-selection block
  - Stats dump on clean exit
- Create: `Makefile`
  - `bench` target: build + run with `WAYSTTY_BENCH=1`, extract stats from stderr
  - `profile` target: build ReleaseSafe + `perf record` + flamegraph generation

### Task 1: Add FrameTiming struct and ring buffer with tests

**Files:**
- Modify: `src/main.zig`
- Test: `src/main.zig`

- [ ] **Step 1: Write the failing tests**

Add at the bottom of `src/main.zig`, after the existing test blocks:

```zig
test "FrameTiming.total sums all sections" {
    const ft: FrameTiming = .{
        .snapshot_us = 10,
        .row_rebuild_us = 20,
        .atlas_upload_us = 30,
        .instance_upload_us = 40,
        .gpu_submit_us = 50,
    };
    try std.testing.expectEqual(@as(u32, 150), ft.total());
}

test "FrameTimingRing records and wraps correctly" {
    var ring = FrameTimingRing{};
    try std.testing.expectEqual(@as(usize, 0), ring.count);

    ring.push(.{ .snapshot_us = 1, .row_rebuild_us = 2, .atlas_upload_us = 3, .instance_upload_us = 4, .gpu_submit_us = 5 });
    try std.testing.expectEqual(@as(usize, 1), ring.count);
    try std.testing.expectEqual(@as(u32, 1), ring.entries[0].snapshot_us);

    // Fill to capacity
    for (1..FrameTimingRing.capacity) |i| {
        ring.push(.{ .snapshot_us = @intCast(i + 1), .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0 });
    }
    try std.testing.expectEqual(FrameTimingRing.capacity, ring.count);

    // One more wraps around — overwrites entries[0], head advances to 1
    ring.push(.{ .snapshot_us = 999, .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0 });
    try std.testing.expectEqual(FrameTimingRing.capacity, ring.count);
    // Newest entry is at (head + capacity - 1) % capacity = 0
    try std.testing.expectEqual(@as(u32, 999), ring.entries[0].snapshot_us);
    // head has advanced past the overwritten slot
    try std.testing.expectEqual(@as(usize, 1), ring.head);
}

test "FrameTimingRing.orderedSlice returns entries in insertion order after wrap" {
    var ring = FrameTimingRing{};
    // Push capacity + 3 entries so the ring wraps
    for (0..FrameTimingRing.capacity + 3) |i| {
        ring.push(.{ .snapshot_us = @intCast(i), .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0 });
    }
    var buf: [FrameTimingRing.capacity]FrameTiming = undefined;
    const ordered = ring.orderedSlice(&buf);
    try std.testing.expectEqual(FrameTimingRing.capacity, ordered.len);
    // First entry should be the 4th pushed (index 3), last should be capacity+2
    try std.testing.expectEqual(@as(u32, 3), ordered[0].snapshot_us);
    try std.testing.expectEqual(@as(u32, FrameTimingRing.capacity + 2), ordered[ordered.len - 1].snapshot_us);
}
```

- [ ] **Step 2: Run test to verify it fails**

Run: `zig build test 2>&1 | head -20`
Expected: FAIL with `FrameTiming` not found.

- [ ] **Step 3: Implement FrameTiming and FrameTimingRing**

Add above the test blocks in `src/main.zig`:

```zig
const FrameTiming = struct {
    snapshot_us: u32 = 0,
    row_rebuild_us: u32 = 0,
    atlas_upload_us: u32 = 0,
    instance_upload_us: u32 = 0,
    gpu_submit_us: u32 = 0,

    fn total(self: FrameTiming) u32 {
        return self.snapshot_us +
            self.row_rebuild_us +
            self.atlas_upload_us +
            self.instance_upload_us +
            self.gpu_submit_us;
    }
};

const FrameTimingRing = struct {
    const capacity = 256;

    entries: [capacity]FrameTiming = [_]FrameTiming{.{}} ** capacity,
    head: usize = 0,
    count: usize = 0,

    fn push(self: *FrameTimingRing, timing: FrameTiming) void {
        const idx = if (self.count < capacity) self.count else self.head;
        self.entries[idx] = timing;
        if (self.count < capacity) {
            self.count += 1;
        } else {
            self.head = (self.head + 1) % capacity;
        }
    }

    /// Return a slice of valid entries in insertion order.
    /// Caller must provide a scratch buffer of `capacity` entries.
    fn orderedSlice(self: *const FrameTimingRing, buf: *[capacity]FrameTiming) []const FrameTiming {
        if (self.count < capacity) {
            return self.entries[0..self.count];
        }
        // Ring has wrapped — copy from head..end then 0..head
        const tail_len = capacity - self.head;
        @memcpy(buf[0..tail_len], self.entries[self.head..capacity]);
        @memcpy(buf[tail_len..capacity], self.entries[0..self.head]);
        return buf[0..capacity];
    }
};
```

- [ ] **Step 4: Run test to verify it passes**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 5: Commit**

```bash
git add src/main.zig
git commit -m "Add FrameTiming struct and ring buffer"
```

### Task 2: Add stats computation and formatting with tests

**Files:**
- Modify: `src/main.zig`
- Test: `src/main.zig`

- [ ] **Step 1: Write the failing tests**

Add after the Task 1 tests in `src/main.zig`:

```zig
test "FrameTimingStats computes min/avg/p99/max correctly" {
    var ring = FrameTimingRing{};
    // Push 100 frames with snapshot_us = 1..100
    for (0..100) |i| {
        ring.push(.{
            .snapshot_us = @intCast(i + 1),
            .row_rebuild_us = 0,
            .atlas_upload_us = 0,
            .instance_upload_us = 0,
            .gpu_submit_us = 0,
        });
    }
    const stats = computeFrameStats(&ring);
    try std.testing.expectEqual(@as(u32, 1), stats.snapshot.min);
    try std.testing.expectEqual(@as(u32, 100), stats.snapshot.max);
    try std.testing.expectEqual(@as(u32, 50), stats.snapshot.avg);
    // p99 of 1..100 = value at index 98 (0-based) = 99
    try std.testing.expectEqual(@as(u32, 99), stats.snapshot.p99);
    try std.testing.expectEqual(@as(usize, 100), stats.frame_count);
}

test "FrameTimingStats handles empty ring" {
    var ring = FrameTimingRing{};
    const stats = computeFrameStats(&ring);
    try std.testing.expectEqual(@as(usize, 0), stats.frame_count);
    try std.testing.expectEqual(@as(u32, 0), stats.snapshot.min);
}
```

- [ ] **Step 2: Run test to verify it fails**

Run: `zig build test 2>&1 | head -20`
Expected: FAIL with `computeFrameStats` not found.

- [ ] **Step 3: Implement stats computation and formatting**

Add after the `FrameTimingRing` definition:

```zig
const SectionStats = struct {
    min: u32 = 0,
    avg: u32 = 0,
    p99: u32 = 0,
    max: u32 = 0,
};

const FrameTimingStats = struct {
    snapshot: SectionStats = .{},
    row_rebuild: SectionStats = .{},
    atlas_upload: SectionStats = .{},
    instance_upload: SectionStats = .{},
    gpu_submit: SectionStats = .{},
    total: SectionStats = .{},
    frame_count: usize = 0,
};

fn computeSectionStats(values: []u32) SectionStats {
    if (values.len == 0) return .{};
    std.mem.sort(u32, values, {}, std.sort.asc(u32));
    var sum: u64 = 0;
    for (values) |v| sum += v;
    const p99_idx = if (values.len <= 1) 0 else ((values.len - 1) * 99) / 100;
    return .{
        .min = values[0],
        .avg = @intCast(sum / values.len),
        .p99 = values[p99_idx],
        .max = values[values.len - 1],
    };
}

fn computeFrameStats(ring: *const FrameTimingRing) FrameTimingStats {
    if (ring.count == 0) return .{};

    var ordered_buf: [FrameTimingRing.capacity]FrameTiming = undefined;
    const entries = ring.orderedSlice(&ordered_buf);
    const n = entries.len;

    var snapshot_vals: [FrameTimingRing.capacity]u32 = undefined;
    var row_rebuild_vals: [FrameTimingRing.capacity]u32 = undefined;
    var atlas_upload_vals: [FrameTimingRing.capacity]u32 = undefined;
    var instance_upload_vals: [FrameTimingRing.capacity]u32 = undefined;
    var gpu_submit_vals: [FrameTimingRing.capacity]u32 = undefined;
    var total_vals: [FrameTimingRing.capacity]u32 = undefined;

    for (entries, 0..) |e, i| {
        snapshot_vals[i] = e.snapshot_us;
        row_rebuild_vals[i] = e.row_rebuild_us;
        atlas_upload_vals[i] = e.atlas_upload_us;
        instance_upload_vals[i] = e.instance_upload_us;
        gpu_submit_vals[i] = e.gpu_submit_us;
        total_vals[i] = e.total();
    }

    return .{
        .snapshot = computeSectionStats(snapshot_vals[0..n]),
        .row_rebuild = computeSectionStats(row_rebuild_vals[0..n]),
        .atlas_upload = computeSectionStats(atlas_upload_vals[0..n]),
        .instance_upload = computeSectionStats(instance_upload_vals[0..n]),
        .gpu_submit = computeSectionStats(gpu_submit_vals[0..n]),
        .total = computeSectionStats(total_vals[0..n]),
        .frame_count = n,
    };
}

fn printFrameStats(stats: FrameTimingStats) void {
    const stderr = std.io.getStdErr().writer();
    stderr.print(
        \\
        \\=== waystty frame timing ({d} frames) ===
        \\{s:<20}{s:>6}{s:>6}{s:>6}{s:>6}  (us)
        \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6}
        \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6}
        \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6}
        \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6}
        \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6}
        \\----------------------------------------------------
        \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6}
        \\
    , .{
        stats.frame_count,
        "section",          "min",                    "avg",                    "p99",                    "max",
        "snapshot",         stats.snapshot.min,        stats.snapshot.avg,        stats.snapshot.p99,        stats.snapshot.max,
        "row_rebuild",      stats.row_rebuild.min,     stats.row_rebuild.avg,     stats.row_rebuild.p99,     stats.row_rebuild.max,
        "atlas_upload",     stats.atlas_upload.min,    stats.atlas_upload.avg,    stats.atlas_upload.p99,    stats.atlas_upload.max,
        "instance_upload",  stats.instance_upload.min, stats.instance_upload.avg, stats.instance_upload.p99, stats.instance_upload.max,
        "gpu_submit",       stats.gpu_submit.min,      stats.gpu_submit.avg,      stats.gpu_submit.p99,      stats.gpu_submit.max,
        "total",            stats.total.min,           stats.total.avg,           stats.total.p99,           stats.total.max,
    }) catch |err| {
        std.log.debug("failed to print frame stats: {}", .{err});
    };
}
```

- [ ] **Step 4: Run test to verify it passes**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 5: Commit**

```bash
git add src/main.zig
git commit -m "Add frame timing stats computation and formatting"
```

### Task 3: Add SIGUSR1 signal handler

**Files:**
- Modify: `src/main.zig`

- [ ] **Step 1: Add the signal flag and handler**

Add below the `FrameTimingRing` and stats code in `src/main.zig`:

```zig
var sigusr1_received: std.atomic.Value(bool) = std.atomic.Value(bool).init(false);

fn sigusr1Handler(_: c_int) callconv(.c) void {
    sigusr1_received.store(true, .release);
}

fn installSigusr1Handler() void {
    const act = std.posix.Sigaction{
        .handler = .{ .handler = sigusr1Handler },
        .mask = std.posix.sigemptyset(),
        .flags = .{ .RESTART = true },
    };
    std.posix.sigaction(std.posix.SIG.USR1, &act, null);
}
```

- [ ] **Step 2: Wire into runTerminal**

In `runTerminal`, right before the `// === main loop ===` comment (line 205), add:

```zig
    // === frame timing ===
    var frame_ring = FrameTimingRing{};
    installSigusr1Handler();
```

Inside the main loop, right after `clearConsumedDirtyFlags` (line 534), add:

```zig
        // Check for SIGUSR1 stats dump request
        if (sigusr1_received.swap(false, .acq_rel)) {
            printFrameStats(computeFrameStats(&frame_ring));
        }
```

Right after the main loop (after the `while` block ends, before `_ = try ctx.vkd.deviceWaitIdle`), add:

```zig
    // Dump timing stats on exit
    printFrameStats(computeFrameStats(&frame_ring));
```

- [ ] **Step 3: Verify it compiles**

Run: `zig build 2>&1 | tail -5`
Expected: BUILD SUCCESS (no test run needed — signal handling is not unit-testable)

- [ ] **Step 4: Commit**

```bash
git add src/main.zig
git commit -m "Add SIGUSR1 handler for frame timing stats dump"
```

### Task 4: Wire section timers into the render loop

**Files:**
- Modify: `src/main.zig`

This task wraps each render-loop section with `std.time.Timer` and pushes a `FrameTiming` entry after each rendered frame.

- [ ] **Step 1: Add timer helper**

Add near the other helper functions in `src/main.zig`:

```zig
fn usFromTimer(timer: std.time.Timer) u32 {
    const ns = timer.read();
    const us = ns / std.time.ns_per_us;
    return std.math.cast(u32, us) orelse std.math.maxInt(u32);
}
```

- [ ] **Step 2: Instrument the render loop**

In `runTerminal`, replace the render section. The existing code between `// === render ===` (line 357) and `clearConsumedDirtyFlags` (line 534) gets wrapped with timers. Add a `var frame_timing: FrameTiming = .{};` before `// === render ===` and instrument each section:

**snapshot section** — wrap `try term.snapshot();` (line 359):

```zig
        var frame_timing: FrameTiming = .{};

        // === render ===
        const previous_cursor = term.render_state.cursor;
        var section_timer = std.time.Timer.start() catch unreachable;
        try term.snapshot();
        frame_timing.snapshot_us = usFromTimer(section_timer);
```

**row_rebuild section** — wrap the dirty-row rebuild loop (the `var rows_rebuilt` through cursor rebuild blocks):

```zig
        section_timer = std.time.Timer.start() catch unreachable;
```

Right before `// Re-upload atlas if new glyphs were added` (line 452):

```zig
        frame_timing.row_rebuild_us = usFromTimer(section_timer);
```

**atlas_upload section** — wrap the atlas upload block:

```zig
        section_timer = std.time.Timer.start() catch unreachable;
        // Re-upload atlas if new glyphs were added
        if (atlas.dirty) {
            try ctx.uploadAtlas(atlas.pixels);
            atlas.dirty = false;
            render_cache.layout_dirty = true;
        }
        frame_timing.atlas_upload_us = usFromTimer(section_timer);
```

**instance_upload section** — wrap the upload plan + upload blocks:

```zig
        section_timer = std.time.Timer.start() catch unreachable;
```

Right before `const baseline_coverage = renderer.coverageVariantParams(.baseline);` (line 517):

```zig
        frame_timing.instance_upload_us = usFromTimer(section_timer);
```

**gpu_submit section** — wrap `ctx.drawCells(...)`:

```zig
        section_timer = std.time.Timer.start() catch unreachable;
        const baseline_coverage = renderer.coverageVariantParams(.baseline);
        ctx.drawCells(
            render_cache.total_instance_count,
            .{ @floatFromInt(cell_w), @floatFromInt(cell_h) },
            default_bg,
            baseline_coverage,
        ) catch |err| switch (err) {
            error.OutOfDateKHR => {
                _ = try ctx.vkd.deviceWaitIdle(ctx.device);
                const buf_w = window.width * @as(u32, @intCast(geom.buffer_scale));
                const buf_h = window.height * @as(u32, @intCast(geom.buffer_scale));
                try ctx.recreateSwapchain(buf_w, buf_h);
                render_pending = true;
                continue;
            },
            else => return err,
        };
        frame_timing.gpu_submit_us = usFromTimer(section_timer);
```

**Push timing entry** — right after the gpu_submit timer read, before `clearConsumedDirtyFlags`:

```zig
        frame_ring.push(frame_timing);
```

- [ ] **Step 3: Verify it compiles**

Run: `zig build 2>&1 | tail -5`
Expected: BUILD SUCCESS

- [ ] **Step 4: Run tests to verify nothing broke**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 5: Commit**

```bash
git add src/main.zig
git commit -m "Instrument render loop with per-section frame timers"
```

### Task 5: Add WAYSTTY_BENCH shell override

**Files:**
- Modify: `src/main.zig`

- [ ] **Step 1: Replace the shell selection block**

In `runTerminal`, the current shell selection code (lines 190-194) is:

```zig
    const shell: [:0]const u8 = blk: {
        const shell_env = std.posix.getenv("SHELL") orelse "/bin/sh";
        break :blk try alloc.dupeZ(u8, shell_env);
    };
    defer alloc.free(shell);
```

Replace it with:

```zig
    const shell: [:0]const u8 = blk: {
        if (std.posix.getenv("WAYSTTY_BENCH") != null) {
            break :blk try alloc.dupeZ(u8, "/bin/sh");
        }
        const shell_env = std.posix.getenv("SHELL") orelse "/bin/sh";
        break :blk try alloc.dupeZ(u8, shell_env);
    };
    defer alloc.free(shell);

    const bench_script: ?[:0]const u8 = if (std.posix.getenv("WAYSTTY_BENCH") != null)
        "echo warmup; sleep 0.2; seq 1 50000; find /usr/lib -name '*.so' 2>/dev/null | head -500; yes 'hello world' | head -2000; exit 0"
    else
        null;
```

- [ ] **Step 2: Pass bench script as shell arg when set**

Replace the `pty.Pty.spawn` call (line 196) with:

```zig
    var p = try pty.Pty.spawn(.{
        .cols = cols,
        .rows = rows,
        .shell = shell,
        .shell_args = if (bench_script) |script| &.{ "-c", script } else null,
    });
```

- [ ] **Step 3: Update pty.zig to accept shell_args**

In `src/pty.zig`, modify the `SpawnOptions` struct (line 18) to add `shell_args`:

```zig
    pub const SpawnOptions = struct {
        cols: u16,
        rows: u16,
        shell: [:0]const u8,
        shell_args: ?[]const [:0]const u8 = null,
    };
```

In the `spawn` function, replace the `argv` construction (line 40) with:

```zig
            if (opts.shell_args) |args| {
                std.debug.assert(args.len < 15); // argv[0] = shell, must fit in 16-slot buffer
                var argv_buf: [16:null]?[*:0]const u8 = .{null} ** 16;
                argv_buf[0] = opts.shell.ptr;
                for (args, 1..) |arg, i| {
                    argv_buf[i] = arg.ptr;
                }
                std.posix.execveZ(opts.shell.ptr, &argv_buf, std.c.environ) catch {};
            } else {
                var argv = [_:null]?[*:0]const u8{ opts.shell.ptr, null };
                std.posix.execveZ(opts.shell.ptr, &argv, std.c.environ) catch {};
            }
```

- [ ] **Step 4: Verify it compiles**

Run: `zig build 2>&1 | tail -5`
Expected: BUILD SUCCESS

- [ ] **Step 5: Run tests**

Run: `zig build test 2>&1 | tail -5`
Expected: PASS

- [ ] **Step 6: Commit**

```bash
git add src/main.zig src/pty.zig
git commit -m "Add WAYSTTY_BENCH env var for bench workload"
```

### Task 6: Create Makefile with bench and profile targets

**Files:**
- Create: `Makefile`

- [ ] **Step 1: Create the Makefile**

Create `Makefile` in the project root:

```makefile
ZIG ?= zig
FLAMEGRAPH ?= flamegraph.pl
STACKCOLLAPSE ?= stackcollapse-perf.pl

.PHONY: build run test bench profile clean

build:
	$(ZIG) build

run: build
	$(ZIG) build run

test:
	$(ZIG) build test

zig-out/bin/waystty: $(wildcard src/*.zig) $(wildcard shaders/*)
	$(ZIG) build

bench: zig-out/bin/waystty
	WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log || true
	@echo "--- frame timing ---"
	@grep -A 12 "waystty frame timing" bench.log || echo "(no timing data found)"

profile:
	$(ZIG) build -Doptimize=ReleaseSafe
	perf record -g -F 999 --no-inherit -o perf.data -- \
		sh -c 'WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log'
	perf script -i perf.data \
		| $(STACKCOLLAPSE) \
		| $(FLAMEGRAPH) > flamegraph.svg
	@echo "--- frame timing ---"
	@grep -A 12 "waystty frame timing" bench.log || echo "(no timing data found)"
	xdg-open flamegraph.svg

clean:
	rm -rf zig-out .zig-cache perf.data bench.log flamegraph.svg
```

- [ ] **Step 2: Verify bench target syntax**

Run: `make -n bench`
Expected: prints the commands that would run (dry run), no syntax errors.

- [ ] **Step 3: Verify profile target syntax**

Run: `make -n profile`
Expected: prints the commands that would run (dry run), no syntax errors.

- [ ] **Step 4: Commit**

```bash
git add Makefile
git commit -m "Add Makefile with bench and profile targets"
```

### Task 7: Full verification

**Files:**
- Test: `src/main.zig`, `src/pty.zig`

- [ ] **Step 1: Run the full test suite**

Run: `zig build test`
Expected: PASS

- [ ] **Step 2: Manual smoke test — normal run**

Run: `zig build run`
Expected:
- Terminal opens and works normally.
- On Ctrl+D / exit, frame timing stats print to stderr.

- [ ] **Step 3: Manual smoke test — SIGUSR1**

In one terminal: `zig build run`
In another terminal: `kill -USR1 $(pgrep waystty)`
Expected: frame timing stats print to stderr of the running waystty.

- [ ] **Step 4: Manual smoke test — bench**

Run: `make bench`
Expected:
- waystty opens, runs the bench workloads, exits.
- `bench.log` contains frame timing stats.
- Stats are printed to the console.

- [ ] **Step 5: Commit if any fixups were needed**

```bash
git add src/main.zig src/pty.zig Makefile
git commit -m "Fix verification issues for performance benchmarking"
```

## Self-Review

- **Spec coverage:**
  - Ring buffer: Task 1
  - Stats computation (min/avg/p99/max): Task 2
  - SIGUSR1 trigger: Task 3
  - Section timers: Task 4
  - WAYSTTY_BENCH shell override: Task 5
  - Makefile bench target: Task 6
  - Makefile profile target: Task 6
  - Clean exit stats dump: Task 3
- **Placeholder scan:** No TBD/TODO markers. All code blocks are complete.
- **Type consistency:**
  - `FrameTiming` defined in Task 1, used in Tasks 2-4
  - `FrameTimingRing` defined in Task 1, used in Tasks 2-4
  - `computeFrameStats` defined in Task 2, called in Task 3
  - `printFrameStats` defined in Task 2, called in Task 3
  - `usFromTimer` defined in Task 4, used in Task 4
  - `SpawnOptions.shell_args` added in Task 5, used in Task 5
  - `sigusr1_received` and `installSigusr1Handler` defined in Task 3, used in Tasks 3-4
diff --git a/docs/superpowers/specs/2026-04-10-incremental-atlas-upload-design.md b/docs/superpowers/specs/2026-04-10-incremental-atlas-upload-design.md
new file mode 100644
index 0000000..99183c6
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-10-incremental-atlas-upload-design.md
@@ -0,0 +1,145 @@
# Incremental Atlas Upload Design

## Goal

Reduce atlas upload cost from full-texture re-upload (~1.7ms avg, 3.6ms peak) to near-zero for steady-state frames by uploading only new glyph rows and precomputing the common ASCII set at startup.

## Current Problem

Every time a new glyph is rasterized into the atlas, `uploadAtlas` re-uploads the entire atlas texture (1024x1024 = 1MB at 1x, 2048x2048 = 4MB at 2x) through a freshly allocated staging buffer, transitions the image layout from `UNDEFINED` (discarding GPU cache), and calls `queueWaitIdle` (CPU stall). Bench data shows this is 61% of average frame time.

## Two Complementary Changes

### 1. Atlas precomputation

Rasterize printable ASCII (codepoints 32–126, 95 characters) into the atlas at startup, before the first frame renders. Do a single full upload of the warm atlas. This eliminates the cold-start spike entirely — most terminal content uses only these characters.

### 2. Incremental upload

For glyphs added after startup (Unicode, CJK, symbols), upload only the new rows instead of the entire texture.

## Dirty-Region Tracking

Add two fields to the `Atlas` struct:
- `last_uploaded_y: u32` — initialized to 0. Tracks how far up the GPU atlas is known-good.
- `needs_full_upload: bool` — initialized to `true`. Set to `true` by `init()` and `reset()`. Cleared after a full upload completes.

The dirty region is always a horizontal band spanning the full atlas width:
- `y_start` = `last_uploaded_y`
- `y_end` = `cursor_y + row_height`

After a successful upload, set `last_uploaded_y = cursor_y` (NOT `cursor_y + row_height`). This ensures the in-progress row is always re-uploaded on the next frame if new glyphs are added to it at new X positions. The cost of re-uploading one row (~20KB for a 20px row in a 1024-wide atlas) is negligible.

Once the packing cursor wraps to a new row, `cursor_y` advances past the previously uploaded row, and those rows are never re-uploaded again.

On `reset()` (DPI/scale change), set `last_uploaded_y = 0` and `needs_full_upload = true`.

If `y_start == y_end`, skip the upload and clear `atlas.dirty` — no pixels actually changed.

## Renderer Changes

Replace `uploadAtlas(pixels)` with `uploadAtlasRegion(pixels, y_start, y_end, full)`:

### Persistent staging buffer

Allocate once at `Context.init`, sized to hold the full atlas (1024x1024 = 1MB, fixed regardless of DPI). Reuse across frames. Free at `Context.deinit`. No per-frame alloc/free.

### Partial staging copy

Only copy the dirty band of pixels into the staging buffer. Byte range: `y_start * atlas_width` to `y_end * atlas_width`.

### Layout transition preserves contents

- Incremental upload: `SHADER_READ_ONLY_OPTIMAL → TRANSFER_DST_OPTIMAL` (preserves existing GPU data)
- Full upload (after reset or initial): `UNDEFINED → TRANSFER_DST_OPTIMAL` (discards, no preservation needed)

The `needs_full_upload` flag controls which transition is used.

### Post-copy barrier

After the `BufferImageCopy`, transition back: `TRANSFER_DST_OPTIMAL → SHADER_READ_ONLY_OPTIMAL`. This is required for both full and incremental uploads (same as the existing code).

### Partial image copy

The `BufferImageCopy` region targets only the dirty rows:
- `image_offset = { .x = 0, .y = y_start, .z = 0 }`
- `image_extent = { .width = atlas_width, .height = y_end - y_start, .depth = 1 }`

### Remove queueWaitIdle

Replace with a dedicated transfer fence. At the start of `uploadAtlasRegion`, if a prior transfer fence is unsignaled, wait on it before writing to the staging buffer or re-recording the command buffer. This prevents corruption if two uploads happen in consecutive frames. After submitting the transfer command, signal the fence.

This is still a win over `queueWaitIdle` because the fence only waits on the single transfer command, not the entire graphics queue.

## Caller-Side Wiring (main.zig)

### Startup precompute

After `Atlas.init` and before the main loop, rasterize codepoints 32–126 into the atlas, then do a single full upload via the existing `uploadAtlas` path.

### Render loop

Replace:
```zig
if (atlas.dirty) {
    try ctx.uploadAtlas(atlas.pixels);
    atlas.dirty = false;
    render_cache.layout_dirty = true;
}
```

With:
```zig
if (atlas.dirty) {
    const y_start = atlas.last_uploaded_y;
    const y_end = atlas.cursor_y + atlas.row_height;
    if (y_start < y_end) {
        try ctx.uploadAtlasRegion(
            atlas.pixels,
            y_start,
            y_end,
            atlas.needs_full_upload,
        );
        atlas.last_uploaded_y = atlas.cursor_y;
        atlas.needs_full_upload = false;
        render_cache.layout_dirty = true;
    }
    atlas.dirty = false;
}
```

## Files Changed

- `src/font.zig` — add `last_uploaded_y` and `needs_full_upload` fields to `Atlas`, reset them in `reset()`
- `src/renderer.zig` — add persistent staging buffer, `uploadAtlasRegion` method, dedicated transfer fence and command buffer
- `src/main.zig` — startup precompute loop, render-loop wiring change

## Testing

### Unit tests (font.zig)

- `last_uploaded_y` starts at 0 and `needs_full_upload` starts `true` after `init()`
- After inserting a glyph, dirty region is `0..cursor_y + row_height`
- After `reset()`, `last_uploaded_y` resets to 0 and `needs_full_upload` is `true`

### Unit tests (renderer.zig)

- `uploadAtlasRegion` byte offset/length calculation: `y_start * width` to `y_end * width`
- Full-upload flag selects `UNDEFINED` vs `SHADER_READ_ONLY` as the old layout

### Manual smoke tests

- Startup shows text correctly (precomputed atlas works)
- Typing a rare Unicode character (`echo "★"`) renders correctly (incremental upload works)
- DPI change still works (full re-upload after reset)
- `make bench` shows atlas_upload dropping from ~1700us to near-zero steady state

## Future Consideration

Precomputing box-drawing (U+2500–U+257F) and block element (U+2580–U+259F) characters would improve first-render for TUI apps like tmux, htop, and tree. Not needed for this phase — the incremental upload handles them correctly on first appearance.

## Non-Goals

- Atlas resizing (atlas is fixed at 1024x1024 regardless of DPI, returns `AtlasFull` error if exhausted)
- Double-buffered atlas images (overkill for a terminal)
- Async transfer queue (single queue is sufficient)
diff --git a/docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md b/docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md
new file mode 100644
index 0000000..c86deec
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md
@@ -0,0 +1,140 @@
# Performance Benchmarking Design

## Goal

Establish a reproducible performance baseline for waystty before tackling known bottlenecks. The primary metric is responsiveness under real workloads — not synthetic throughput scores.

## Non-goals

- vtebench integration (rewards batching, doesn't measure latency)
- tracy GPU profiling (GPU draw cost is negligible for a terminal; CPU-side bottlenecks dominate)
- Input-to-display latency measurement (out of scope for this phase)

## Known bottlenecks (to be measured, then fixed)

1. Atlas full re-upload on any new glyph — entire atlas through staging buffer + `queueWaitIdle` stall
2. Instance buffer map/unmap on every frame — host-visible memory can stay persistently mapped
3. Atlas staging buffer allocated/freed on every upload — should be persistent
4. Atlas image layout transitions from `UNDEFINED` — should go `SHADER_READ_ONLY → TRANSFER_DST → SHADER_READ_ONLY` for incremental updates

## Module 1: Frame timing ring buffer

### Instrumented sections

Five sections timed with `std.time.Timer` on every rendered frame:

| Section | What it covers |
|---|---|
| `snapshot` | `term.snapshot()` |
| `row_rebuild` | refresh planning + dirty-row rebuild + cursor rebuild |
| `atlas_upload` | `ctx.uploadAtlas(...)` — zero when atlas is not dirty |
| `instance_upload` | `uploadInstances` / `uploadInstanceRange` |
| `gpu_submit` | fence wait + image acquire + command record + submit + present. Note: the fence wait blocks on the *previous* frame's GPU work, so this section includes GPU execution time of frame N-1. This is correct for latency measurement (actual wall-clock cost of this phase). |

Idle frames (no render) are not recorded.

### Data structure

256-entry ring buffer of `FrameTiming` structs in `src/main.zig`. All fields are `u32` microseconds. ~6KB total. Always compiled in — timer reads are negligible overhead.

```zig
const FrameTiming = struct {
    snapshot_us: u32 = 0,
    row_rebuild_us: u32 = 0,
    atlas_upload_us: u32 = 0,
    instance_upload_us: u32 = 0,
    gpu_submit_us: u32 = 0,
};
```

### Stats output

Triggered on SIGUSR1 and on clean exit. Prints to stderr:

```
=== waystty frame timing (243 frames) ===
section          min    avg    p99    max  (µs)
snapshot           2      4     15     89
row_rebuild        1     12    124    890
atlas_upload       0    180   5200   8100
instance_upload    1      6     24     71
gpu_submit         3      8     35    210
─────────────────────────────────────────
total              9    210   5400   8800
```

p99 computed by sorting a copy of the 256 values per section.

## Module 2: Bench workload

### Mechanism

When `WAYSTTY_BENCH=1` env var is set at startup, spawn `sh -c '<bench script>'` instead of `$SHELL`. Stats are dumped to stderr on exit (clean shell exit triggers the normal exit path).

### Workloads

```sh
echo warmup; sleep 0.2;
seq 1 50000;
find /usr/lib -name '*.so' 2>/dev/null | head -500;
yes 'hello world' | head -2000;
exit 0
```

- `echo warmup; sleep 0.2` — lets the atlas rasterize common ASCII before timing real workloads
- `seq` — burst of short sequential lines, tests frame batching and row rebuild
- `find` — irregular line lengths, mixed output cadence  
- `yes` — high-frequency identical lines, tests the low-change-rate path

### Makefile target

```makefile
.PHONY: bench
bench: zig-out/bin/waystty
	WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log
	@echo "--- frame timing ---"
	@grep -A 12 "waystty frame timing" bench.log
```

## Module 3: perf + flamegraph

### Build mode

`ReleaseSafe` — keeps debug symbols and frame pointers. `ReleaseFast` may omit frame pointers, producing useless perf stacks.

### Makefile target

```makefile
FLAMEGRAPH ?= flamegraph.pl
STACKCOLLAPSE ?= stackcollapse-perf.pl

.PHONY: profile
profile:
	zig build -Doptimize=ReleaseSafe
	perf record -g -F 999 --no-inherit -o perf.data -- \
		sh -c 'WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log'
	perf script -i perf.data \
		| $(STACKCOLLAPSE) \
		| $(FLAMEGRAPH) > flamegraph.svg
	@echo "--- frame timing ---"
	@grep -A 12 "waystty frame timing" bench.log
	xdg-open flamegraph.svg
```

`FLAMEGRAPH` and `STACKCOLLAPSE` default to scripts in `PATH` (available via `flamegraph` package on Arch), overridable: `make profile FLAMEGRAPH=~/FlameGraph/flamegraph.pl`.

### Prerequisites

- `flamegraph` package (provides `flamegraph.pl` and `stackcollapse-perf.pl`)
- `perf` with `CAP_PERFMON` or `/proc/sys/kernel/perf_event_paranoid <= 1`

## Files changed

- `src/main.zig` — ring buffer, section timers, SIGUSR1 handler, `WAYSTTY_BENCH` env check
- `Makefile` — `bench` and `profile` targets

## Testing

- Run `make bench`, verify stats appear in bench.log
- Send SIGUSR1 to a running waystty, verify stats print to stderr
- Run `make profile`, verify flamegraph.svg opens and shows waystty frames