fc9f9849
Add performance benchmarking and incremental atlas upload specs and plans
a73x 2026-04-10 10:17
Design specs and implementation plans for: - Per-section frame timing with ring buffer, SIGUSR1 stats dump, WAYSTTY_BENCH workload, and perf/flamegraph Makefile targets - Incremental atlas upload with ASCII precompute, dirty-region tracking, persistent staging buffer, and fence-based sync Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
diff --git a/docs/superpowers/plans/2026-04-10-incremental-atlas-upload-implementation.md b/docs/superpowers/plans/2026-04-10-incremental-atlas-upload-implementation.md new file mode 100644 index 0000000..82878de --- /dev/null +++ b/docs/superpowers/plans/2026-04-10-incremental-atlas-upload-implementation.md @@ -0,0 +1,483 @@ # Incremental Atlas Upload Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Reduce atlas upload cost from ~1.7ms to near-zero by precomputing ASCII glyphs at startup and uploading only dirty atlas rows incrementally. **Architecture:** Add `last_uploaded_y` and `needs_full_upload` tracking fields to the Atlas struct in `font.zig`. Add `uploadAtlasRegion` to `renderer.zig` with a persistent staging buffer, content-preserving layout transitions, and a dedicated transfer fence. Wire the precompute loop and incremental upload into `main.zig`. **Tech Stack:** Zig 0.15, Vulkan host-visible staging buffers, image layout transitions, fence synchronization. --- ## File Structure - Modify: `src/font.zig` - Add `last_uploaded_y: u32` and `needs_full_upload: bool` to `Atlas` - Update `init()` and `reset()` to set these fields - Modify: `src/renderer.zig` - Add persistent staging buffer + dedicated transfer command buffer + transfer fence to `Context` - Add `uploadAtlasRegion(pixels, y_start, y_end, full)` method - Keep existing `uploadAtlas` as full-upload convenience wrapper - Modify: `src/main.zig` - Add ASCII precompute loop at startup - Replace render-loop atlas upload with incremental path ### Task 1: Add dirty-region tracking fields to Atlas with tests **Files:** - Modify: `src/font.zig` - Test: `src/font.zig` - [ ] **Step 1: Write the failing tests** Add at the bottom of `src/font.zig`, after the existing test blocks: ```zig test "Atlas dirty tracking fields initialized correctly" { var atlas = try Atlas.init(std.testing.allocator, 256, 256); defer atlas.deinit(); try std.testing.expectEqual(@as(u32, 0), atlas.last_uploaded_y); try std.testing.expect(atlas.needs_full_upload); } test "Atlas dirty region covers new glyphs" { var atlas = try Atlas.init(std.testing.allocator, 256, 256); defer atlas.deinit(); // After init, cursor_y=0, row_height=1 (for the white pixel) const y_start = atlas.last_uploaded_y; const y_end = atlas.cursor_y + atlas.row_height; try std.testing.expectEqual(@as(u32, 0), y_start); try std.testing.expect(y_end > 0); } test "Atlas reset restores dirty tracking fields" { var atlas = try Atlas.init(std.testing.allocator, 256, 256); defer atlas.deinit(); // Simulate having uploaded some region atlas.last_uploaded_y = 50; atlas.needs_full_upload = false; atlas.reset(); try std.testing.expectEqual(@as(u32, 0), atlas.last_uploaded_y); try std.testing.expect(atlas.needs_full_upload); } ``` - [ ] **Step 2: Run test to verify it fails** Run: `zig build test 2>&1 | head -20` Expected: FAIL — `last_uploaded_y` field does not exist. - [ ] **Step 3: Add the fields to Atlas** In `src/font.zig`, add to the `Atlas` struct fields (after `dirty: bool`): ```zig last_uploaded_y: u32, needs_full_upload: bool, ``` In `Atlas.init` (the return struct literal), add: ```zig .last_uploaded_y = 0, .needs_full_upload = true, ``` In `Atlas.reset`, add at the end (after `self.dirty = true;`): ```zig self.last_uploaded_y = 0; self.needs_full_upload = true; ``` - [ ] **Step 4: Run test to verify it passes** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 5: Commit** ```bash git add src/font.zig git commit -m "Add dirty-region tracking fields to Atlas" ``` ### Task 2: Add persistent staging buffer and transfer fence to renderer **Files:** - Modify: `src/renderer.zig` - [ ] **Step 1: Add fields to Context struct** In `src/renderer.zig`, add three new fields to the `Context` struct after `atlas_height: u32`: ```zig // Persistent atlas staging buffer (reused across frames) atlas_staging_buffer: vk.Buffer, atlas_staging_memory: vk.DeviceMemory, // Dedicated transfer command buffer + fence atlas_transfer_cb: vk.CommandBuffer, atlas_transfer_fence: vk.Fence, ``` - [ ] **Step 2: Allocate resources in Context.init** In `Context.init`, after the atlas sampler creation and before the descriptor set update (around line 910), add: ```zig // --- Atlas staging buffer (persistent, reused across frames) --- const atlas_staging_size: vk.DeviceSize = @as(vk.DeviceSize, atlas_width) * atlas_height; const atlas_staging = try createHostVisibleBuffer(vki, pd_info.physical, vkd, device, atlas_staging_size, .{ .transfer_src_bit = true }); errdefer { vkd.destroyBuffer(device, atlas_staging.buffer, null); vkd.freeMemory(device, atlas_staging.memory, null); } // --- Dedicated atlas transfer command buffer --- var atlas_transfer_cb: vk.CommandBuffer = undefined; try vkd.allocateCommandBuffers(device, &vk.CommandBufferAllocateInfo{ .command_pool = command_pool, .level = .primary, .command_buffer_count = 1, }, @ptrCast(&atlas_transfer_cb)); // --- Atlas transfer fence (starts signaled so first wait is a no-op) --- const atlas_transfer_fence = try vkd.createFence(device, &vk.FenceCreateInfo{ .flags = .{ .signaled_bit = true }, }, null); errdefer vkd.destroyFence(device, atlas_transfer_fence, null); ``` - [ ] **Step 3: Add new fields to the return struct** In the return struct literal in `Context.init`, add after `.atlas_height = atlas_height`: ```zig .atlas_staging_buffer = atlas_staging.buffer, .atlas_staging_memory = atlas_staging.memory, .atlas_transfer_cb = atlas_transfer_cb, .atlas_transfer_fence = atlas_transfer_fence, ``` - [ ] **Step 4: Free resources in Context.deinit** In `Context.deinit`, add after the atlas memory free (after `self.vkd.freeMemory(self.device, self.atlas_memory, null);`): ```zig self.vkd.destroyBuffer(self.device, self.atlas_staging_buffer, null); self.vkd.freeMemory(self.device, self.atlas_staging_memory, null); self.vkd.destroyFence(self.device, self.atlas_transfer_fence, null); ``` - [ ] **Step 5: Verify it compiles** Run: `zig build 2>&1 | tail -5` Expected: BUILD SUCCESS - [ ] **Step 6: Run tests** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 7: Commit** ```bash git add src/renderer.zig git commit -m "Add persistent staging buffer and transfer fence to renderer" ``` ### Task 3: Implement uploadAtlasRegion **Files:** - Modify: `src/renderer.zig` - [ ] **Step 1: Add the uploadAtlasRegion method** Add after the existing `uploadAtlas` method in `Context`: ```zig /// Upload a horizontal band of the atlas (y_start..y_end) to the GPU. /// Uses the persistent staging buffer and dedicated transfer command buffer. /// If `full` is true, transitions from UNDEFINED (for initial/reset uploads). /// Otherwise transitions from SHADER_READ_ONLY (preserves existing data). pub fn uploadAtlasRegion( self: *Context, pixels: []const u8, y_start: u32, y_end: u32, full: bool, ) !void { if (y_start >= y_end) return; const byte_offset: usize = @as(usize, y_start) * self.atlas_width; const byte_len: usize = @as(usize, y_end - y_start) * self.atlas_width; // Wait for any prior atlas transfer to finish before reusing staging buffer _ = try self.vkd.waitForFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence), .true, std.math.maxInt(u64)); try self.vkd.resetFences(self.device, 1, @ptrCast(&self.atlas_transfer_fence)); // Copy dirty band into staging buffer const mapped = try self.vkd.mapMemory(self.device, self.atlas_staging_memory, 0, @intCast(byte_len), .{}); @memcpy(@as([*]u8, @ptrCast(mapped))[0..byte_len], pixels[byte_offset .. byte_offset + byte_len]); self.vkd.unmapMemory(self.device, self.atlas_staging_memory); // Record transfer command try self.vkd.resetCommandBuffer(self.atlas_transfer_cb, .{}); try self.vkd.beginCommandBuffer(self.atlas_transfer_cb, &vk.CommandBufferBeginInfo{ .flags = .{ .one_time_submit_bit = true }, }); // Barrier: old_layout -> TRANSFER_DST const old_layout: vk.ImageLayout = if (full) .undefined else .shader_read_only_optimal; const barrier_to_transfer = vk.ImageMemoryBarrier{ .src_access_mask = if (full) @as(vk.AccessFlags, .{}) else .{ .shader_read_bit = true }, .dst_access_mask = .{ .transfer_write_bit = true }, .old_layout = old_layout, .new_layout = .transfer_dst_optimal, .src_queue_family_index = vk.QUEUE_FAMILY_IGNORED, .dst_queue_family_index = vk.QUEUE_FAMILY_IGNORED, .image = self.atlas_image, .subresource_range = .{ .aspect_mask = .{ .color_bit = true }, .base_mip_level = 0, .level_count = 1, .base_array_layer = 0, .layer_count = 1, }, }; const src_stage: vk.PipelineStageFlags = if (full) .{ .top_of_pipe_bit = true } else .{ .fragment_shader_bit = true }; self.vkd.cmdPipelineBarrier( self.atlas_transfer_cb, src_stage, .{ .transfer_bit = true }, .{}, 0, null, 0, null, 1, @ptrCast(&barrier_to_transfer), ); // Copy staging buffer -> image (dirty band only) const region = vk.BufferImageCopy{ .buffer_offset = 0, .buffer_row_length = 0, .buffer_image_height = 0, .image_subresource = .{ .aspect_mask = .{ .color_bit = true }, .mip_level = 0, .base_array_layer = 0, .layer_count = 1, }, .image_offset = .{ .x = 0, .y = @intCast(y_start), .z = 0 }, .image_extent = .{ .width = self.atlas_width, .height = y_end - y_start, .depth = 1 }, }; self.vkd.cmdCopyBufferToImage( self.atlas_transfer_cb, self.atlas_staging_buffer, self.atlas_image, .transfer_dst_optimal, 1, @ptrCast(®ion), ); // Barrier: TRANSFER_DST -> SHADER_READ_ONLY const barrier_to_shader = vk.ImageMemoryBarrier{ .src_access_mask = .{ .transfer_write_bit = true }, .dst_access_mask = .{ .shader_read_bit = true }, .old_layout = .transfer_dst_optimal, .new_layout = .shader_read_only_optimal, .src_queue_family_index = vk.QUEUE_FAMILY_IGNORED, .dst_queue_family_index = vk.QUEUE_FAMILY_IGNORED, .image = self.atlas_image, .subresource_range = .{ .aspect_mask = .{ .color_bit = true }, .base_mip_level = 0, .level_count = 1, .base_array_layer = 0, .layer_count = 1, }, }; self.vkd.cmdPipelineBarrier( self.atlas_transfer_cb, .{ .transfer_bit = true }, .{ .fragment_shader_bit = true }, .{}, 0, null, 0, null, 1, @ptrCast(&barrier_to_shader), ); try self.vkd.endCommandBuffer(self.atlas_transfer_cb); // Submit with dedicated fence (no queueWaitIdle) try self.vkd.queueSubmit(self.graphics_queue, 1, @ptrCast(&vk.SubmitInfo{ .command_buffer_count = 1, .p_command_buffers = @ptrCast(&self.atlas_transfer_cb), }), self.atlas_transfer_fence); } ``` - [ ] **Step 2: Verify it compiles** Run: `zig build 2>&1 | tail -5` Expected: BUILD SUCCESS - [ ] **Step 3: Run tests** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 4: Commit** ```bash git add src/renderer.zig git commit -m "Implement uploadAtlasRegion with incremental uploads" ``` ### Task 4: Add ASCII precompute and wire incremental upload into main.zig **Files:** - Modify: `src/main.zig` - [ ] **Step 1: Add ASCII precompute at startup** In `src/main.zig`, replace the block at lines 171-172: ```zig // Upload empty atlas first (so descriptor set is valid) try ctx.uploadAtlas(atlas.pixels); ``` With: ```zig // Precompute printable ASCII glyphs (32-126) into atlas for (32..127) |cp| { _ = atlas.getOrInsert(&face, @intCast(cp)) catch |err| switch (err) { error.AtlasFull => break, else => return err, }; } // Upload warm atlas (full upload — descriptor set needs valid data) try ctx.uploadAtlas(atlas.pixels); atlas.last_uploaded_y = atlas.cursor_y; atlas.needs_full_upload = false; atlas.dirty = false; ``` - [ ] **Step 2: Replace the render-loop atlas upload** In `src/main.zig`, replace the atlas upload block (lines 477-482): ```zig // Re-upload atlas if new glyphs were added if (atlas.dirty) { try ctx.uploadAtlas(atlas.pixels); atlas.dirty = false; render_cache.layout_dirty = true; } ``` With: ```zig // Re-upload atlas if new glyphs were added (incremental) if (atlas.dirty) { const y_start = atlas.last_uploaded_y; const y_end = atlas.cursor_y + atlas.row_height; if (y_start < y_end) { try ctx.uploadAtlasRegion( atlas.pixels, y_start, y_end, atlas.needs_full_upload, ); atlas.last_uploaded_y = atlas.cursor_y; atlas.needs_full_upload = false; render_cache.layout_dirty = true; } atlas.dirty = false; } ``` - [ ] **Step 3: Verify it compiles** Run: `zig build 2>&1 | tail -5` Expected: BUILD SUCCESS - [ ] **Step 4: Run tests** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 5: Commit** ```bash git add src/main.zig git commit -m "Wire ASCII precompute and incremental atlas upload" ``` ### Task 5: Full verification **Files:** - Test: `src/font.zig`, `src/renderer.zig`, `src/main.zig` - [ ] **Step 1: Run the full test suite** Run: `zig build test` Expected: PASS - [ ] **Step 2: Manual smoke test — normal run** Run: `zig build run` Expected: - Terminal opens and shows text correctly (precomputed ASCII atlas). - Typing normal text works. Cursor renders. - Exit dumps frame timing stats — atlas_upload should be 0 for most frames. - [ ] **Step 3: Manual smoke test — Unicode character** Run inside terminal: `echo "★ ← → ★"` Expected: Characters render correctly (incremental upload fires for the first time these codepoints appear). - [ ] **Step 4: Manual smoke test — bench comparison** Run: `make bench` Expected: - atlas_upload avg should drop significantly from the baseline ~1700us. - Steady-state frames should show atlas_upload near 0. - [ ] **Step 5: Commit if any fixups were needed** ```bash git add src/font.zig src/renderer.zig src/main.zig git commit -m "Fix verification issues for incremental atlas upload" ``` ## Self-Review - **Spec coverage:** - `last_uploaded_y` + `needs_full_upload` fields: Task 1 - `reset()` sets both fields: Task 1 - Persistent staging buffer: Task 2 - Transfer fence (starts signaled): Task 2 - `uploadAtlasRegion` with partial copy: Task 3 - Layout transition: `UNDEFINED` vs `SHADER_READ_ONLY` based on `full` flag: Task 3 - Post-copy barrier back to `SHADER_READ_ONLY`: Task 3 - Fence wait before reusing staging buffer: Task 3 - No `queueWaitIdle`: Task 3 - ASCII precompute (32-126): Task 4 - Render-loop incremental wiring with `y_start < y_end` guard: Task 4 - `last_uploaded_y = cursor_y` (not `cursor_y + row_height`): Task 4 - Bench comparison: Task 5 - **Placeholder scan:** No TBD/TODO markers. All code blocks are complete. - **Type consistency:** - `Atlas.last_uploaded_y` and `Atlas.needs_full_upload` defined in Task 1, used in Task 4 - `Context.atlas_staging_buffer`, `atlas_staging_memory`, `atlas_transfer_cb`, `atlas_transfer_fence` defined in Task 2, used in Task 3 - `uploadAtlasRegion(pixels, y_start, y_end, full)` defined in Task 3, called in Task 4 - Existing `uploadAtlas` kept unchanged — used for initial full upload in Task 4 diff --git a/docs/superpowers/plans/2026-04-10-performance-benchmarking-implementation.md b/docs/superpowers/plans/2026-04-10-performance-benchmarking-implementation.md new file mode 100644 index 0000000..fcc5b99 --- /dev/null +++ b/docs/superpowers/plans/2026-04-10-performance-benchmarking-implementation.md @@ -0,0 +1,711 @@ # Performance Benchmarking Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Add per-section frame timing instrumentation, a reproducible bench workload, and a perf/flamegraph target so we can measure responsiveness before and after fixing known bottlenecks. **Architecture:** A 256-entry ring buffer of `FrameTiming` structs records microsecond timings for five render-loop sections. Stats are dumped to stderr on SIGUSR1 and clean exit. A `WAYSTTY_BENCH=1` env var swaps the user's shell for a fixed workload script. A `Makefile` provides `bench` and `profile` targets. **Tech Stack:** Zig 0.15, `std.time.Timer`, `std.posix.sigaction`, `perf record`, `flamegraph.pl`/`stackcollapse-perf.pl` --- ## File Structure - Modify: `src/main.zig` - `FrameTiming` struct, `FrameTimingRing` ring buffer, `computeStats` helper, `formatStats` printer - SIGUSR1 signal handler that sets an atomic flag - Section timers wrapping each render-loop phase - `WAYSTTY_BENCH` env var check in the shell-selection block - Stats dump on clean exit - Create: `Makefile` - `bench` target: build + run with `WAYSTTY_BENCH=1`, extract stats from stderr - `profile` target: build ReleaseSafe + `perf record` + flamegraph generation ### Task 1: Add FrameTiming struct and ring buffer with tests **Files:** - Modify: `src/main.zig` - Test: `src/main.zig` - [ ] **Step 1: Write the failing tests** Add at the bottom of `src/main.zig`, after the existing test blocks: ```zig test "FrameTiming.total sums all sections" { const ft: FrameTiming = .{ .snapshot_us = 10, .row_rebuild_us = 20, .atlas_upload_us = 30, .instance_upload_us = 40, .gpu_submit_us = 50, }; try std.testing.expectEqual(@as(u32, 150), ft.total()); } test "FrameTimingRing records and wraps correctly" { var ring = FrameTimingRing{}; try std.testing.expectEqual(@as(usize, 0), ring.count); ring.push(.{ .snapshot_us = 1, .row_rebuild_us = 2, .atlas_upload_us = 3, .instance_upload_us = 4, .gpu_submit_us = 5 }); try std.testing.expectEqual(@as(usize, 1), ring.count); try std.testing.expectEqual(@as(u32, 1), ring.entries[0].snapshot_us); // Fill to capacity for (1..FrameTimingRing.capacity) |i| { ring.push(.{ .snapshot_us = @intCast(i + 1), .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0 }); } try std.testing.expectEqual(FrameTimingRing.capacity, ring.count); // One more wraps around — overwrites entries[0], head advances to 1 ring.push(.{ .snapshot_us = 999, .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0 }); try std.testing.expectEqual(FrameTimingRing.capacity, ring.count); // Newest entry is at (head + capacity - 1) % capacity = 0 try std.testing.expectEqual(@as(u32, 999), ring.entries[0].snapshot_us); // head has advanced past the overwritten slot try std.testing.expectEqual(@as(usize, 1), ring.head); } test "FrameTimingRing.orderedSlice returns entries in insertion order after wrap" { var ring = FrameTimingRing{}; // Push capacity + 3 entries so the ring wraps for (0..FrameTimingRing.capacity + 3) |i| { ring.push(.{ .snapshot_us = @intCast(i), .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0 }); } var buf: [FrameTimingRing.capacity]FrameTiming = undefined; const ordered = ring.orderedSlice(&buf); try std.testing.expectEqual(FrameTimingRing.capacity, ordered.len); // First entry should be the 4th pushed (index 3), last should be capacity+2 try std.testing.expectEqual(@as(u32, 3), ordered[0].snapshot_us); try std.testing.expectEqual(@as(u32, FrameTimingRing.capacity + 2), ordered[ordered.len - 1].snapshot_us); } ``` - [ ] **Step 2: Run test to verify it fails** Run: `zig build test 2>&1 | head -20` Expected: FAIL with `FrameTiming` not found. - [ ] **Step 3: Implement FrameTiming and FrameTimingRing** Add above the test blocks in `src/main.zig`: ```zig const FrameTiming = struct { snapshot_us: u32 = 0, row_rebuild_us: u32 = 0, atlas_upload_us: u32 = 0, instance_upload_us: u32 = 0, gpu_submit_us: u32 = 0, fn total(self: FrameTiming) u32 { return self.snapshot_us + self.row_rebuild_us + self.atlas_upload_us + self.instance_upload_us + self.gpu_submit_us; } }; const FrameTimingRing = struct { const capacity = 256; entries: [capacity]FrameTiming = [_]FrameTiming{.{}} ** capacity, head: usize = 0, count: usize = 0, fn push(self: *FrameTimingRing, timing: FrameTiming) void { const idx = if (self.count < capacity) self.count else self.head; self.entries[idx] = timing; if (self.count < capacity) { self.count += 1; } else { self.head = (self.head + 1) % capacity; } } /// Return a slice of valid entries in insertion order. /// Caller must provide a scratch buffer of `capacity` entries. fn orderedSlice(self: *const FrameTimingRing, buf: *[capacity]FrameTiming) []const FrameTiming { if (self.count < capacity) { return self.entries[0..self.count]; } // Ring has wrapped — copy from head..end then 0..head const tail_len = capacity - self.head; @memcpy(buf[0..tail_len], self.entries[self.head..capacity]); @memcpy(buf[tail_len..capacity], self.entries[0..self.head]); return buf[0..capacity]; } }; ``` - [ ] **Step 4: Run test to verify it passes** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 5: Commit** ```bash git add src/main.zig git commit -m "Add FrameTiming struct and ring buffer" ``` ### Task 2: Add stats computation and formatting with tests **Files:** - Modify: `src/main.zig` - Test: `src/main.zig` - [ ] **Step 1: Write the failing tests** Add after the Task 1 tests in `src/main.zig`: ```zig test "FrameTimingStats computes min/avg/p99/max correctly" { var ring = FrameTimingRing{}; // Push 100 frames with snapshot_us = 1..100 for (0..100) |i| { ring.push(.{ .snapshot_us = @intCast(i + 1), .row_rebuild_us = 0, .atlas_upload_us = 0, .instance_upload_us = 0, .gpu_submit_us = 0, }); } const stats = computeFrameStats(&ring); try std.testing.expectEqual(@as(u32, 1), stats.snapshot.min); try std.testing.expectEqual(@as(u32, 100), stats.snapshot.max); try std.testing.expectEqual(@as(u32, 50), stats.snapshot.avg); // p99 of 1..100 = value at index 98 (0-based) = 99 try std.testing.expectEqual(@as(u32, 99), stats.snapshot.p99); try std.testing.expectEqual(@as(usize, 100), stats.frame_count); } test "FrameTimingStats handles empty ring" { var ring = FrameTimingRing{}; const stats = computeFrameStats(&ring); try std.testing.expectEqual(@as(usize, 0), stats.frame_count); try std.testing.expectEqual(@as(u32, 0), stats.snapshot.min); } ``` - [ ] **Step 2: Run test to verify it fails** Run: `zig build test 2>&1 | head -20` Expected: FAIL with `computeFrameStats` not found. - [ ] **Step 3: Implement stats computation and formatting** Add after the `FrameTimingRing` definition: ```zig const SectionStats = struct { min: u32 = 0, avg: u32 = 0, p99: u32 = 0, max: u32 = 0, }; const FrameTimingStats = struct { snapshot: SectionStats = .{}, row_rebuild: SectionStats = .{}, atlas_upload: SectionStats = .{}, instance_upload: SectionStats = .{}, gpu_submit: SectionStats = .{}, total: SectionStats = .{}, frame_count: usize = 0, }; fn computeSectionStats(values: []u32) SectionStats { if (values.len == 0) return .{}; std.mem.sort(u32, values, {}, std.sort.asc(u32)); var sum: u64 = 0; for (values) |v| sum += v; const p99_idx = if (values.len <= 1) 0 else ((values.len - 1) * 99) / 100; return .{ .min = values[0], .avg = @intCast(sum / values.len), .p99 = values[p99_idx], .max = values[values.len - 1], }; } fn computeFrameStats(ring: *const FrameTimingRing) FrameTimingStats { if (ring.count == 0) return .{}; var ordered_buf: [FrameTimingRing.capacity]FrameTiming = undefined; const entries = ring.orderedSlice(&ordered_buf); const n = entries.len; var snapshot_vals: [FrameTimingRing.capacity]u32 = undefined; var row_rebuild_vals: [FrameTimingRing.capacity]u32 = undefined; var atlas_upload_vals: [FrameTimingRing.capacity]u32 = undefined; var instance_upload_vals: [FrameTimingRing.capacity]u32 = undefined; var gpu_submit_vals: [FrameTimingRing.capacity]u32 = undefined; var total_vals: [FrameTimingRing.capacity]u32 = undefined; for (entries, 0..) |e, i| { snapshot_vals[i] = e.snapshot_us; row_rebuild_vals[i] = e.row_rebuild_us; atlas_upload_vals[i] = e.atlas_upload_us; instance_upload_vals[i] = e.instance_upload_us; gpu_submit_vals[i] = e.gpu_submit_us; total_vals[i] = e.total(); } return .{ .snapshot = computeSectionStats(snapshot_vals[0..n]), .row_rebuild = computeSectionStats(row_rebuild_vals[0..n]), .atlas_upload = computeSectionStats(atlas_upload_vals[0..n]), .instance_upload = computeSectionStats(instance_upload_vals[0..n]), .gpu_submit = computeSectionStats(gpu_submit_vals[0..n]), .total = computeSectionStats(total_vals[0..n]), .frame_count = n, }; } fn printFrameStats(stats: FrameTimingStats) void { const stderr = std.io.getStdErr().writer(); stderr.print( \\ \\=== waystty frame timing ({d} frames) === \\{s:<20}{s:>6}{s:>6}{s:>6}{s:>6} (us) \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6} \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6} \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6} \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6} \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6} \\---------------------------------------------------- \\{s:<20}{d:>6}{d:>6}{d:>6}{d:>6} \\ , .{ stats.frame_count, "section", "min", "avg", "p99", "max", "snapshot", stats.snapshot.min, stats.snapshot.avg, stats.snapshot.p99, stats.snapshot.max, "row_rebuild", stats.row_rebuild.min, stats.row_rebuild.avg, stats.row_rebuild.p99, stats.row_rebuild.max, "atlas_upload", stats.atlas_upload.min, stats.atlas_upload.avg, stats.atlas_upload.p99, stats.atlas_upload.max, "instance_upload", stats.instance_upload.min, stats.instance_upload.avg, stats.instance_upload.p99, stats.instance_upload.max, "gpu_submit", stats.gpu_submit.min, stats.gpu_submit.avg, stats.gpu_submit.p99, stats.gpu_submit.max, "total", stats.total.min, stats.total.avg, stats.total.p99, stats.total.max, }) catch |err| { std.log.debug("failed to print frame stats: {}", .{err}); }; } ``` - [ ] **Step 4: Run test to verify it passes** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 5: Commit** ```bash git add src/main.zig git commit -m "Add frame timing stats computation and formatting" ``` ### Task 3: Add SIGUSR1 signal handler **Files:** - Modify: `src/main.zig` - [ ] **Step 1: Add the signal flag and handler** Add below the `FrameTimingRing` and stats code in `src/main.zig`: ```zig var sigusr1_received: std.atomic.Value(bool) = std.atomic.Value(bool).init(false); fn sigusr1Handler(_: c_int) callconv(.c) void { sigusr1_received.store(true, .release); } fn installSigusr1Handler() void { const act = std.posix.Sigaction{ .handler = .{ .handler = sigusr1Handler }, .mask = std.posix.sigemptyset(), .flags = .{ .RESTART = true }, }; std.posix.sigaction(std.posix.SIG.USR1, &act, null); } ``` - [ ] **Step 2: Wire into runTerminal** In `runTerminal`, right before the `// === main loop ===` comment (line 205), add: ```zig // === frame timing === var frame_ring = FrameTimingRing{}; installSigusr1Handler(); ``` Inside the main loop, right after `clearConsumedDirtyFlags` (line 534), add: ```zig // Check for SIGUSR1 stats dump request if (sigusr1_received.swap(false, .acq_rel)) { printFrameStats(computeFrameStats(&frame_ring)); } ``` Right after the main loop (after the `while` block ends, before `_ = try ctx.vkd.deviceWaitIdle`), add: ```zig // Dump timing stats on exit printFrameStats(computeFrameStats(&frame_ring)); ``` - [ ] **Step 3: Verify it compiles** Run: `zig build 2>&1 | tail -5` Expected: BUILD SUCCESS (no test run needed — signal handling is not unit-testable) - [ ] **Step 4: Commit** ```bash git add src/main.zig git commit -m "Add SIGUSR1 handler for frame timing stats dump" ``` ### Task 4: Wire section timers into the render loop **Files:** - Modify: `src/main.zig` This task wraps each render-loop section with `std.time.Timer` and pushes a `FrameTiming` entry after each rendered frame. - [ ] **Step 1: Add timer helper** Add near the other helper functions in `src/main.zig`: ```zig fn usFromTimer(timer: std.time.Timer) u32 { const ns = timer.read(); const us = ns / std.time.ns_per_us; return std.math.cast(u32, us) orelse std.math.maxInt(u32); } ``` - [ ] **Step 2: Instrument the render loop** In `runTerminal`, replace the render section. The existing code between `// === render ===` (line 357) and `clearConsumedDirtyFlags` (line 534) gets wrapped with timers. Add a `var frame_timing: FrameTiming = .{};` before `// === render ===` and instrument each section: **snapshot section** — wrap `try term.snapshot();` (line 359): ```zig var frame_timing: FrameTiming = .{}; // === render === const previous_cursor = term.render_state.cursor; var section_timer = std.time.Timer.start() catch unreachable; try term.snapshot(); frame_timing.snapshot_us = usFromTimer(section_timer); ``` **row_rebuild section** — wrap the dirty-row rebuild loop (the `var rows_rebuilt` through cursor rebuild blocks): ```zig section_timer = std.time.Timer.start() catch unreachable; ``` Right before `// Re-upload atlas if new glyphs were added` (line 452): ```zig frame_timing.row_rebuild_us = usFromTimer(section_timer); ``` **atlas_upload section** — wrap the atlas upload block: ```zig section_timer = std.time.Timer.start() catch unreachable; // Re-upload atlas if new glyphs were added if (atlas.dirty) { try ctx.uploadAtlas(atlas.pixels); atlas.dirty = false; render_cache.layout_dirty = true; } frame_timing.atlas_upload_us = usFromTimer(section_timer); ``` **instance_upload section** — wrap the upload plan + upload blocks: ```zig section_timer = std.time.Timer.start() catch unreachable; ``` Right before `const baseline_coverage = renderer.coverageVariantParams(.baseline);` (line 517): ```zig frame_timing.instance_upload_us = usFromTimer(section_timer); ``` **gpu_submit section** — wrap `ctx.drawCells(...)`: ```zig section_timer = std.time.Timer.start() catch unreachable; const baseline_coverage = renderer.coverageVariantParams(.baseline); ctx.drawCells( render_cache.total_instance_count, .{ @floatFromInt(cell_w), @floatFromInt(cell_h) }, default_bg, baseline_coverage, ) catch |err| switch (err) { error.OutOfDateKHR => { _ = try ctx.vkd.deviceWaitIdle(ctx.device); const buf_w = window.width * @as(u32, @intCast(geom.buffer_scale)); const buf_h = window.height * @as(u32, @intCast(geom.buffer_scale)); try ctx.recreateSwapchain(buf_w, buf_h); render_pending = true; continue; }, else => return err, }; frame_timing.gpu_submit_us = usFromTimer(section_timer); ``` **Push timing entry** — right after the gpu_submit timer read, before `clearConsumedDirtyFlags`: ```zig frame_ring.push(frame_timing); ``` - [ ] **Step 3: Verify it compiles** Run: `zig build 2>&1 | tail -5` Expected: BUILD SUCCESS - [ ] **Step 4: Run tests to verify nothing broke** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 5: Commit** ```bash git add src/main.zig git commit -m "Instrument render loop with per-section frame timers" ``` ### Task 5: Add WAYSTTY_BENCH shell override **Files:** - Modify: `src/main.zig` - [ ] **Step 1: Replace the shell selection block** In `runTerminal`, the current shell selection code (lines 190-194) is: ```zig const shell: [:0]const u8 = blk: { const shell_env = std.posix.getenv("SHELL") orelse "/bin/sh"; break :blk try alloc.dupeZ(u8, shell_env); }; defer alloc.free(shell); ``` Replace it with: ```zig const shell: [:0]const u8 = blk: { if (std.posix.getenv("WAYSTTY_BENCH") != null) { break :blk try alloc.dupeZ(u8, "/bin/sh"); } const shell_env = std.posix.getenv("SHELL") orelse "/bin/sh"; break :blk try alloc.dupeZ(u8, shell_env); }; defer alloc.free(shell); const bench_script: ?[:0]const u8 = if (std.posix.getenv("WAYSTTY_BENCH") != null) "echo warmup; sleep 0.2; seq 1 50000; find /usr/lib -name '*.so' 2>/dev/null | head -500; yes 'hello world' | head -2000; exit 0" else null; ``` - [ ] **Step 2: Pass bench script as shell arg when set** Replace the `pty.Pty.spawn` call (line 196) with: ```zig var p = try pty.Pty.spawn(.{ .cols = cols, .rows = rows, .shell = shell, .shell_args = if (bench_script) |script| &.{ "-c", script } else null, }); ``` - [ ] **Step 3: Update pty.zig to accept shell_args** In `src/pty.zig`, modify the `SpawnOptions` struct (line 18) to add `shell_args`: ```zig pub const SpawnOptions = struct { cols: u16, rows: u16, shell: [:0]const u8, shell_args: ?[]const [:0]const u8 = null, }; ``` In the `spawn` function, replace the `argv` construction (line 40) with: ```zig if (opts.shell_args) |args| { std.debug.assert(args.len < 15); // argv[0] = shell, must fit in 16-slot buffer var argv_buf: [16:null]?[*:0]const u8 = .{null} ** 16; argv_buf[0] = opts.shell.ptr; for (args, 1..) |arg, i| { argv_buf[i] = arg.ptr; } std.posix.execveZ(opts.shell.ptr, &argv_buf, std.c.environ) catch {}; } else { var argv = [_:null]?[*:0]const u8{ opts.shell.ptr, null }; std.posix.execveZ(opts.shell.ptr, &argv, std.c.environ) catch {}; } ``` - [ ] **Step 4: Verify it compiles** Run: `zig build 2>&1 | tail -5` Expected: BUILD SUCCESS - [ ] **Step 5: Run tests** Run: `zig build test 2>&1 | tail -5` Expected: PASS - [ ] **Step 6: Commit** ```bash git add src/main.zig src/pty.zig git commit -m "Add WAYSTTY_BENCH env var for bench workload" ``` ### Task 6: Create Makefile with bench and profile targets **Files:** - Create: `Makefile` - [ ] **Step 1: Create the Makefile** Create `Makefile` in the project root: ```makefile ZIG ?= zig FLAMEGRAPH ?= flamegraph.pl STACKCOLLAPSE ?= stackcollapse-perf.pl .PHONY: build run test bench profile clean build: $(ZIG) build run: build $(ZIG) build run test: $(ZIG) build test zig-out/bin/waystty: $(wildcard src/*.zig) $(wildcard shaders/*) $(ZIG) build bench: zig-out/bin/waystty WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log || true @echo "--- frame timing ---" @grep -A 12 "waystty frame timing" bench.log || echo "(no timing data found)" profile: $(ZIG) build -Doptimize=ReleaseSafe perf record -g -F 999 --no-inherit -o perf.data -- \ sh -c 'WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log' perf script -i perf.data \ | $(STACKCOLLAPSE) \ | $(FLAMEGRAPH) > flamegraph.svg @echo "--- frame timing ---" @grep -A 12 "waystty frame timing" bench.log || echo "(no timing data found)" xdg-open flamegraph.svg clean: rm -rf zig-out .zig-cache perf.data bench.log flamegraph.svg ``` - [ ] **Step 2: Verify bench target syntax** Run: `make -n bench` Expected: prints the commands that would run (dry run), no syntax errors. - [ ] **Step 3: Verify profile target syntax** Run: `make -n profile` Expected: prints the commands that would run (dry run), no syntax errors. - [ ] **Step 4: Commit** ```bash git add Makefile git commit -m "Add Makefile with bench and profile targets" ``` ### Task 7: Full verification **Files:** - Test: `src/main.zig`, `src/pty.zig` - [ ] **Step 1: Run the full test suite** Run: `zig build test` Expected: PASS - [ ] **Step 2: Manual smoke test — normal run** Run: `zig build run` Expected: - Terminal opens and works normally. - On Ctrl+D / exit, frame timing stats print to stderr. - [ ] **Step 3: Manual smoke test — SIGUSR1** In one terminal: `zig build run` In another terminal: `kill -USR1 $(pgrep waystty)` Expected: frame timing stats print to stderr of the running waystty. - [ ] **Step 4: Manual smoke test — bench** Run: `make bench` Expected: - waystty opens, runs the bench workloads, exits. - `bench.log` contains frame timing stats. - Stats are printed to the console. - [ ] **Step 5: Commit if any fixups were needed** ```bash git add src/main.zig src/pty.zig Makefile git commit -m "Fix verification issues for performance benchmarking" ``` ## Self-Review - **Spec coverage:** - Ring buffer: Task 1 - Stats computation (min/avg/p99/max): Task 2 - SIGUSR1 trigger: Task 3 - Section timers: Task 4 - WAYSTTY_BENCH shell override: Task 5 - Makefile bench target: Task 6 - Makefile profile target: Task 6 - Clean exit stats dump: Task 3 - **Placeholder scan:** No TBD/TODO markers. All code blocks are complete. - **Type consistency:** - `FrameTiming` defined in Task 1, used in Tasks 2-4 - `FrameTimingRing` defined in Task 1, used in Tasks 2-4 - `computeFrameStats` defined in Task 2, called in Task 3 - `printFrameStats` defined in Task 2, called in Task 3 - `usFromTimer` defined in Task 4, used in Task 4 - `SpawnOptions.shell_args` added in Task 5, used in Task 5 - `sigusr1_received` and `installSigusr1Handler` defined in Task 3, used in Tasks 3-4 diff --git a/docs/superpowers/specs/2026-04-10-incremental-atlas-upload-design.md b/docs/superpowers/specs/2026-04-10-incremental-atlas-upload-design.md new file mode 100644 index 0000000..99183c6 --- /dev/null +++ b/docs/superpowers/specs/2026-04-10-incremental-atlas-upload-design.md @@ -0,0 +1,145 @@ # Incremental Atlas Upload Design ## Goal Reduce atlas upload cost from full-texture re-upload (~1.7ms avg, 3.6ms peak) to near-zero for steady-state frames by uploading only new glyph rows and precomputing the common ASCII set at startup. ## Current Problem Every time a new glyph is rasterized into the atlas, `uploadAtlas` re-uploads the entire atlas texture (1024x1024 = 1MB at 1x, 2048x2048 = 4MB at 2x) through a freshly allocated staging buffer, transitions the image layout from `UNDEFINED` (discarding GPU cache), and calls `queueWaitIdle` (CPU stall). Bench data shows this is 61% of average frame time. ## Two Complementary Changes ### 1. Atlas precomputation Rasterize printable ASCII (codepoints 32–126, 95 characters) into the atlas at startup, before the first frame renders. Do a single full upload of the warm atlas. This eliminates the cold-start spike entirely — most terminal content uses only these characters. ### 2. Incremental upload For glyphs added after startup (Unicode, CJK, symbols), upload only the new rows instead of the entire texture. ## Dirty-Region Tracking Add two fields to the `Atlas` struct: - `last_uploaded_y: u32` — initialized to 0. Tracks how far up the GPU atlas is known-good. - `needs_full_upload: bool` — initialized to `true`. Set to `true` by `init()` and `reset()`. Cleared after a full upload completes. The dirty region is always a horizontal band spanning the full atlas width: - `y_start` = `last_uploaded_y` - `y_end` = `cursor_y + row_height` After a successful upload, set `last_uploaded_y = cursor_y` (NOT `cursor_y + row_height`). This ensures the in-progress row is always re-uploaded on the next frame if new glyphs are added to it at new X positions. The cost of re-uploading one row (~20KB for a 20px row in a 1024-wide atlas) is negligible. Once the packing cursor wraps to a new row, `cursor_y` advances past the previously uploaded row, and those rows are never re-uploaded again. On `reset()` (DPI/scale change), set `last_uploaded_y = 0` and `needs_full_upload = true`. If `y_start == y_end`, skip the upload and clear `atlas.dirty` — no pixels actually changed. ## Renderer Changes Replace `uploadAtlas(pixels)` with `uploadAtlasRegion(pixels, y_start, y_end, full)`: ### Persistent staging buffer Allocate once at `Context.init`, sized to hold the full atlas (1024x1024 = 1MB, fixed regardless of DPI). Reuse across frames. Free at `Context.deinit`. No per-frame alloc/free. ### Partial staging copy Only copy the dirty band of pixels into the staging buffer. Byte range: `y_start * atlas_width` to `y_end * atlas_width`. ### Layout transition preserves contents - Incremental upload: `SHADER_READ_ONLY_OPTIMAL → TRANSFER_DST_OPTIMAL` (preserves existing GPU data) - Full upload (after reset or initial): `UNDEFINED → TRANSFER_DST_OPTIMAL` (discards, no preservation needed) The `needs_full_upload` flag controls which transition is used. ### Post-copy barrier After the `BufferImageCopy`, transition back: `TRANSFER_DST_OPTIMAL → SHADER_READ_ONLY_OPTIMAL`. This is required for both full and incremental uploads (same as the existing code). ### Partial image copy The `BufferImageCopy` region targets only the dirty rows: - `image_offset = { .x = 0, .y = y_start, .z = 0 }` - `image_extent = { .width = atlas_width, .height = y_end - y_start, .depth = 1 }` ### Remove queueWaitIdle Replace with a dedicated transfer fence. At the start of `uploadAtlasRegion`, if a prior transfer fence is unsignaled, wait on it before writing to the staging buffer or re-recording the command buffer. This prevents corruption if two uploads happen in consecutive frames. After submitting the transfer command, signal the fence. This is still a win over `queueWaitIdle` because the fence only waits on the single transfer command, not the entire graphics queue. ## Caller-Side Wiring (main.zig) ### Startup precompute After `Atlas.init` and before the main loop, rasterize codepoints 32–126 into the atlas, then do a single full upload via the existing `uploadAtlas` path. ### Render loop Replace: ```zig if (atlas.dirty) { try ctx.uploadAtlas(atlas.pixels); atlas.dirty = false; render_cache.layout_dirty = true; } ``` With: ```zig if (atlas.dirty) { const y_start = atlas.last_uploaded_y; const y_end = atlas.cursor_y + atlas.row_height; if (y_start < y_end) { try ctx.uploadAtlasRegion( atlas.pixels, y_start, y_end, atlas.needs_full_upload, ); atlas.last_uploaded_y = atlas.cursor_y; atlas.needs_full_upload = false; render_cache.layout_dirty = true; } atlas.dirty = false; } ``` ## Files Changed - `src/font.zig` — add `last_uploaded_y` and `needs_full_upload` fields to `Atlas`, reset them in `reset()` - `src/renderer.zig` — add persistent staging buffer, `uploadAtlasRegion` method, dedicated transfer fence and command buffer - `src/main.zig` — startup precompute loop, render-loop wiring change ## Testing ### Unit tests (font.zig) - `last_uploaded_y` starts at 0 and `needs_full_upload` starts `true` after `init()` - After inserting a glyph, dirty region is `0..cursor_y + row_height` - After `reset()`, `last_uploaded_y` resets to 0 and `needs_full_upload` is `true` ### Unit tests (renderer.zig) - `uploadAtlasRegion` byte offset/length calculation: `y_start * width` to `y_end * width` - Full-upload flag selects `UNDEFINED` vs `SHADER_READ_ONLY` as the old layout ### Manual smoke tests - Startup shows text correctly (precomputed atlas works) - Typing a rare Unicode character (`echo "★"`) renders correctly (incremental upload works) - DPI change still works (full re-upload after reset) - `make bench` shows atlas_upload dropping from ~1700us to near-zero steady state ## Future Consideration Precomputing box-drawing (U+2500–U+257F) and block element (U+2580–U+259F) characters would improve first-render for TUI apps like tmux, htop, and tree. Not needed for this phase — the incremental upload handles them correctly on first appearance. ## Non-Goals - Atlas resizing (atlas is fixed at 1024x1024 regardless of DPI, returns `AtlasFull` error if exhausted) - Double-buffered atlas images (overkill for a terminal) - Async transfer queue (single queue is sufficient) diff --git a/docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md b/docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md new file mode 100644 index 0000000..c86deec --- /dev/null +++ b/docs/superpowers/specs/2026-04-10-performance-benchmarking-design.md @@ -0,0 +1,140 @@ # Performance Benchmarking Design ## Goal Establish a reproducible performance baseline for waystty before tackling known bottlenecks. The primary metric is responsiveness under real workloads — not synthetic throughput scores. ## Non-goals - vtebench integration (rewards batching, doesn't measure latency) - tracy GPU profiling (GPU draw cost is negligible for a terminal; CPU-side bottlenecks dominate) - Input-to-display latency measurement (out of scope for this phase) ## Known bottlenecks (to be measured, then fixed) 1. Atlas full re-upload on any new glyph — entire atlas through staging buffer + `queueWaitIdle` stall 2. Instance buffer map/unmap on every frame — host-visible memory can stay persistently mapped 3. Atlas staging buffer allocated/freed on every upload — should be persistent 4. Atlas image layout transitions from `UNDEFINED` — should go `SHADER_READ_ONLY → TRANSFER_DST → SHADER_READ_ONLY` for incremental updates ## Module 1: Frame timing ring buffer ### Instrumented sections Five sections timed with `std.time.Timer` on every rendered frame: | Section | What it covers | |---|---| | `snapshot` | `term.snapshot()` | | `row_rebuild` | refresh planning + dirty-row rebuild + cursor rebuild | | `atlas_upload` | `ctx.uploadAtlas(...)` — zero when atlas is not dirty | | `instance_upload` | `uploadInstances` / `uploadInstanceRange` | | `gpu_submit` | fence wait + image acquire + command record + submit + present. Note: the fence wait blocks on the *previous* frame's GPU work, so this section includes GPU execution time of frame N-1. This is correct for latency measurement (actual wall-clock cost of this phase). | Idle frames (no render) are not recorded. ### Data structure 256-entry ring buffer of `FrameTiming` structs in `src/main.zig`. All fields are `u32` microseconds. ~6KB total. Always compiled in — timer reads are negligible overhead. ```zig const FrameTiming = struct { snapshot_us: u32 = 0, row_rebuild_us: u32 = 0, atlas_upload_us: u32 = 0, instance_upload_us: u32 = 0, gpu_submit_us: u32 = 0, }; ``` ### Stats output Triggered on SIGUSR1 and on clean exit. Prints to stderr: ``` === waystty frame timing (243 frames) === section min avg p99 max (µs) snapshot 2 4 15 89 row_rebuild 1 12 124 890 atlas_upload 0 180 5200 8100 instance_upload 1 6 24 71 gpu_submit 3 8 35 210 ───────────────────────────────────────── total 9 210 5400 8800 ``` p99 computed by sorting a copy of the 256 values per section. ## Module 2: Bench workload ### Mechanism When `WAYSTTY_BENCH=1` env var is set at startup, spawn `sh -c '<bench script>'` instead of `$SHELL`. Stats are dumped to stderr on exit (clean shell exit triggers the normal exit path). ### Workloads ```sh echo warmup; sleep 0.2; seq 1 50000; find /usr/lib -name '*.so' 2>/dev/null | head -500; yes 'hello world' | head -2000; exit 0 ``` - `echo warmup; sleep 0.2` — lets the atlas rasterize common ASCII before timing real workloads - `seq` — burst of short sequential lines, tests frame batching and row rebuild - `find` — irregular line lengths, mixed output cadence - `yes` — high-frequency identical lines, tests the low-change-rate path ### Makefile target ```makefile .PHONY: bench bench: zig-out/bin/waystty WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log @echo "--- frame timing ---" @grep -A 12 "waystty frame timing" bench.log ``` ## Module 3: perf + flamegraph ### Build mode `ReleaseSafe` — keeps debug symbols and frame pointers. `ReleaseFast` may omit frame pointers, producing useless perf stacks. ### Makefile target ```makefile FLAMEGRAPH ?= flamegraph.pl STACKCOLLAPSE ?= stackcollapse-perf.pl .PHONY: profile profile: zig build -Doptimize=ReleaseSafe perf record -g -F 999 --no-inherit -o perf.data -- \ sh -c 'WAYSTTY_BENCH=1 ./zig-out/bin/waystty 2>bench.log' perf script -i perf.data \ | $(STACKCOLLAPSE) \ | $(FLAMEGRAPH) > flamegraph.svg @echo "--- frame timing ---" @grep -A 12 "waystty frame timing" bench.log xdg-open flamegraph.svg ``` `FLAMEGRAPH` and `STACKCOLLAPSE` default to scripts in `PATH` (available via `flamegraph` package on Arch), overridable: `make profile FLAMEGRAPH=~/FlameGraph/flamegraph.pl`. ### Prerequisites - `flamegraph` package (provides `flamegraph.pl` and `stackcollapse-perf.pl`) - `perf` with `CAP_PERFMON` or `/proc/sys/kernel/perf_event_paranoid <= 1` ## Files changed - `src/main.zig` — ring buffer, section timers, SIGUSR1 handler, `WAYSTTY_BENCH` env check - `Makefile` — `bench` and `profile` targets ## Testing - Run `make bench`, verify stats appear in bench.log - Send SIGUSR1 to a running waystty, verify stats print to stderr - Run `make profile`, verify flamegraph.svg opens and shows waystty frames