Thursday, February 27, 2025

Whorld Link: Synchronization and Synesthesia

Whorld is driven by oscillators. In order to make smooth graphics, the oscillators require a precise timing source. In V1, that source was a multimedia timer, but in V2, the timer is gone. In V2, the oscillators are driven by the display monitor: rendering is synchronized with the monitor’s vertical retrace, via a DXGI swap chain, just as in a game.

One consequence of this display-driven design is that precise synchronization with an external device is only possible in one direction: Whorld can be the master, but never the slave. Like V1, V2 currently only supports MIDI input, but it could also send MIDI output, such as MIDI clocks, which an external device could synchronize itself with. It would also be possible to send continuous controller messages for the phase of each oscillator, though this would need to be managed carefully to avoid overloading MIDI and/or the receiver.

For such a scheme, it will matter which type of MIDI we’re talking about. Old-school hardware MIDI has extremely limited bandwidth: 30K Baud or about 1000 messages per second at most. USB over MIDI is at least an order of magnitude faster. The fastest is a software connection within the same computer, via Tobias Erichsen's loopMIDI for example: that should be nearly instantaneous.

I like the idea of sending controller messages for oscillator phase, because it opens up possibilities for generative art, and specifically for synesthesia. I implemented a crude version of this scheme back in 2006. The name of that project is Plasmagon. It’s a specific Whorld patch controlling a music generator in real time. It was a hacked version of Whorld V1 that sent the phase of its oscillators to Propellerhead Reason via MIDI CC messages. There’s no proper video because Whorld doesn’t compress well, but the music is lovely and you can hear it HERE.

Wednesday, February 26, 2025

Whorld's new thread architecture

In Whorld V1, configure a MIDI controller to control a setting, one with an immediate and obvious effect, like Master Speed. Now continuously move the controller with one hand, and with your other hand, right-click in the Whorld’s caption bar and keep the right mouse button pressed. The MIDI controller stops affecting the rendering, and this freeze persists until you release the right mouse button.

This happens because you’re blocking the Windows message loop, and in V1, the MIDI events are routed through the Windows message loop on their way to the render thread. If the message loop blocks, MIDI events are blocked too.

This bug is gone in V2.

The MIDI thread has a more complex task now. Instead of merely posting input MIDI messages to the UI thread, the MIDI thread now does the mapping, and queues the resulting commands directly to the render thread. The UI thread has no involvement in that route and cannot impede it. The MIDI thread also posts the corresponding parameter changes to the UI thread, so that it can update the sliders and edit boxes. If the UI thread is busy or blocked, updates pile up, but MIDI control of rendering is unaffected. And of course, the UI thread also sends commands to the render thread, in response to user edits.

This is a proper professional design. The only complication is that the user can modify the MIDI mapping. That means the MIDI and UI threads must share the mapping data, and such sharing must be managed carefully to avoid data corruption. Luckily, the MIDI thread reads the mapping data but never modifies it. Only the UI thread modifies the mappings, and this makes synchronization much easier.

Click to enlarge diagram

Sunday, February 23, 2025

V1 versus V2 side-by-side comparison

Here are side-by-side comparisons for some of my favorite snapshots. The antialiasing is a win.

Snapshot Movie throughput test

At the END of OnDraw:

if (m_hMovieFile != INVALID_HANDLE_VALUE) { // if we're recording a snapshot movie
  CBenchmark b;
  CSnapshot* pSnapshot = GetSnapshot(); // get the snapshot (allocates on heap)
  DWORD	bWritten;
  WriteFile(m_hMovieFile, pSnapshot, pSnapshot->GetSize(), &bWritten, NULL); // write the snapshot
  delete pSnapshot; // delete the snapshot
  stats.Print(b.Elapsed());
}

And open the file somewhere:

m_hMovieFile = CreateFile(_T("test.whm"),
  GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

Don't forget to also CLOSE the file via CloseHandle.

The statistics:

minimum: 13 µs
maximum: 175 µs
average: 45 µs
standard deviation: 35 µs

Sample size: 60 (one second at 60 Hz)

Every now and then I see a maximum two orders of magnitude larger, in the range of 3 ms to 6 ms. This is long enough to worry about, and possibly sufficient justification for asynchronous writes, using either a single buffer, or a fancy ring of buffers as GPT suggests.

Asynchronous I/O is ideal when the application can do other work while the I/O is completing, and that's certainly the case here. Whorld can block in Present, and then completely draw the next frame before it needs the result of the previous write. Using a multi-buffer model, it could go even longer without waiting.

This same justification could apply when playing back the movie! By using asynchronous reads and a ring of buffers, it would be possible to stay several frames ahead, so that while the render thread is drawing or presenting the current frame, the OS is busy reading the next frame(s).

Given how complex all that is, it should definitely be wrapped in an object.

Snapshot movies

CK: I’m making steady progress on the Whorld feature set. Today I got snapshots working. A “snapshot” is a Whorld-specific binary file format that contains all the necessary information to reconstruct a Whorld frame. It’s essentially Whorld’s equivalent of a vector-based format. It’s incredibly compact: a typical snapshot is between 10K and 50K bytes. The size varies with the number of “rings” in the drawing. Internally, a snapshot consists of a fixed-length portion containing various drawing state data, followed by a VLA (Variable Length Array) having one RING element for each of the drawing’s rings.

The advantage of snapshots—compared to the File/Export command, which exports the drawing as a PNG file—is that snapshots are resolution-independent. So a snapshot can subsequently be reloaded and exported at a much higher resolution than was being used when the snapshot was “taken” (captured).

In the original Whorld, it was also possible to record a “movie” consisting of a series of snapshots, one per displayed frame. The snapshot movie is resolution-independent; no pixels are recorded. The snapshot movie could subsequently be reloaded and exported as video at any desired resolution. The snapshot movie could also be viewed without exporting it, since the snapshots are easily redrawn as frames. And, it was possible to export only a subset (range) of the frames.

I hope to implement the above-described movie recording capability in Whorld V2, but with some important differences, as follows:

  1. Export image sequence only: V2 will only export a snapshot movie as a numbered image sequence. The primary advantage of an image sequence is that it keeps video compression completely out of Whorld’s code. Even today, Windows video compression is a bloated hellscape of obscurity, and will never be as flexible or satisfactory as ffmpeg. Not to defend Microsoft, but this is partly due to IP issues that ffmpeg’s developers sidestep.

  2. Reconsider snapshot storage: I would like to revisit how exactly the app stores the snapshots during recording. The render thread is responsible for recording the movie, so storing each frame needs to be very fast, otherwise we’ll fall behind the display frequency and rendering will be disrupted.

In my view, the most fundamental decision is whether to store the snapshots in a file, or in memory. To decide which option makes the most sense, we need some capacity analysis. The original Whorld app recorded the snapshot movie to a file, but it also was running at a slower frame rate, typically 25 Hz.

A typical patch with about 200 rings requires around 40K bytes per frame. At 60 Hz, that’s 2.5 MB per second, 144 MB per minute, or 8.64 GB per hour. Agreed?

Ideally the user should be able to record an entire VJ performance, which could potentially last for hours. That goal seems incompatible with recording to memory. A typical target system might have only 16 GB of memory, so we’d be out of memory in less than two hours of recording.

Let’s examine the file option, which will be much less constrained in terms of maximum recording size. According to Google, a typical hard disk write speed is 30-150 MB/s (in 2024). We need 2.5 MB/s, so we’re well within capacity there.

So based on that simple analysis, I’m leaning towards writing the frames to a file, just as Whorld V1 did. The question then is: what type of file will perform best? There are many options in Windows, but based on my experience (with another app of mine, WaveShop, which recorded audio) in my opinion we’ll get the best performance using CreateFile. Do you concur with that opinion?

CreateFile gives us some powerful and flexible options that other file types don’t. In particular, we should in my opinion consider this sentence from MSDN “When FILE_FLAG_NO_BUFFERING is combined with FILE_FLAG_OVERLAPPED, the flags give maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager.”

Our top priority is that the write operation should be as fast as possible. Is it reasonable in your view to use asynchronous (overlapped) writes for this application? On the surface, it seems ideal, but I know from experience that it brings many complications. In particular, it’s necessary to keep the data buffer and the OVERLAPPED structure allocated and unmodified until the write completes. So in practice I suspect we wouldn’t be able to do better than overlapping by one frame. But this could still be a big win. Let’s compare the cases:

  1. Synchronous write: The render thread finishes rendering the frame. Before presenting the frame to the swap chain, it stores the snapshot data to a buffer, and then writes that buffer to the snapshot movie file. If the file write takes too long, the render thread calls Present too late and misses the vertical retrace, causing a glitch in the displayed output. Disaster!

  2. Asynchronous write: As before, the render thread finishes rendering the frame, stores the snapshot data to a buffer, and then writes it to the file. But in this case, the write function returns almost instantly. Now the render thread calls Present, and while it’s blocked in Present waiting for vertical retrace, the OS writes to the file asynchronously. When Present unblocks, the render thread begins drawing the next frame, and again, the write continues asynchronously. Only when the next frame is completely drawn do we need to wait for previous frame’s write to complete.

This sure sounds like a significant optimization. But, overlapped I/O incurs OS overhead. The question we must answer: is the increased overhead of overlapped I/O justified by the potential performance gain from asynchronous writing? In order to determine that, we need a basis for comparison. So it seems to me that the first step is to set up CreateFile for regular synchronous I/O, and then have Whorld record a snapshot movie, and while it does so, accumulate statistics—minimum, maximum, average, and standard deviation—on how long the file writes are taking. If it turns out that writing 40K is consistently too fast to be worth optimizing, we can just go with that, and avoid all the additional development time and complexity risk of using overlapped I/O.

Your thoughts?

ChatGPT: Weighing write strategies

I’m thinking through comparing synchronous and asynchronous snapshot recording methods. Synchronous write to a file seems most viable, using 2.5 MB/s.

ChatGPT: Short Answer: It’s perfectly reasonable to start with a regular synchronous file write (CreateFile in normal mode) and measure how long it takes to write ~40 KB per frame. If it’s consistently small (well under the ~16 ms budget at 60 Hz), you can skip the complexity of overlapped I/O. If it’s borderline, then you can explore overlapped (asynchronous) writes with a small ring buffer of snapshots. Yes, CreateFile is the standard approach for best performance/flexibility (especially if you consider flags like FILE_FLAG_NO_BUFFERING and FILE_FLAG_OVERLAPPED).

Below are the detailed considerations:


1) Writing to Disk vs Memory: The 2.5 MB/s Rate

A) Disk Should Easily Handle ~2.5 MB/s

  • Modern HDDs do 30–150 MB/s, SSDs are far faster.
  • 2.5 MB/s is typically trivial. Even random writes at 40 KB chunks might be okay, but the best-case scenario is a decently contiguous file on disk.

B) Why Not Store in Memory?

  • As you said, 8.64 GB/hour is feasible on disk, but not always feasible in memory if the user records for hours. You’d risk using all your 16 GB of system RAM. So disk is the correct solution.


2) Synchronous Write vs Overlapped I/O

A) Synchronous Write Approach

  1. Render thread finishes frame, obtains the ~40 KB snapshot data, calls WriteFile(...) (synchronous).
  2. If the OS can write it quickly, your thread returns in time for the next Present call.
  3. If the disk is momentarily slow, the write call blocks, the render thread might miss its vertical retrace scheduling → stutter in visuals.

Pros:

  • Very simple code.
  • No concurrency overhead, no buffer management.

Cons:
  • Any disk stall >16 ms can cause a missed frame.

B) Overlapped (Asynchronous) Approach

  1. Render thread obtains the 40 KB snapshot, calls WriteFile with an OVERLAPPED structure. The OS returns immediately if the call is queued successfully.
  2. Meanwhile, the OS writes data to disk in parallel while your thread continues.
  3. You must keep that 40 KB buffer + OVERLAPPED struct valid until the write completes. Possibly use a ring buffer with a few entries so you can keep writing new frames without stalling.
  4. Before reusing that same buffer for the next snapshot, you wait on the previous write’s completion event or poll GetOverlappedResult with bWait=FALSE.

Pros:

  • If the disk is normally fast enough, the thread never or rarely stalls.
  • Rare slow writes might get hidden behind the next frame’s rendering/present.

Cons:
  • More complex code: ring buffer of snapshot memory, overlapped data structures, etc.
  • If the disk can’t keep up for many consecutive frames, eventually your ring buffer fills and you stall anyway.


3) Do a Measurement First

Your plan to “do synchronous I/O and gather min/avg/max time” is excellent:

  1. Set up a test scenario: 60 Hz rendering, 40 KB per frame => ~2.5 MB/s.
  2. Log each WriteFile call’s duration (endTick - startTick).
  3. If you see typical times well under 1 ms, you’re safe with synchronous.
  4. If you see occasional spikes of 50 ms or 200 ms, that might cause dropped frames. Then overlapped is more appealing.

Typical Realities

  • On an SSD or a lightly loaded system, 40 KB writes are often <1 ms, especially if the OS does caching.
  • On an HDD with other processes or random seeks, you might see occasional large stalls.


4) FILE_FLAG_NO_BUFFERING and FILE_FLAG_OVERLAPPED

A) FILE_FLAG_NO_BUFFERING

  • Bypasses the OS file cache, reading/writing directly to/from disk. That requires sector-aligned buffers and sizes. This can improve or degrade performance, depending on the disk usage.
  • Usually used in high-throughput servers or certain real-time apps. For your case, you might get better performance from letting the OS cache the writes in memory (the default), especially for many small ~40 KB writes.

B) Overlapped I/O

  • If you do overlapped, you can combine it with or without NO_BUFFERING.
  • If you do NO_BUFFERING, each write must be multiple-of-sector-size and aligned, which is more complexity.
  • If you do overlapped with default buffering, you can just do 40 KB, the OS caches it, returns quickly, then flushes to disk asynchronously. That might be enough to avoid stutters.

Conclusion: Overlapped + default buffering is typically simpler. NO_BUFFERING might not help for small writes unless you’re extremely performance-critical and prepared to handle alignment constraints.


5) Implementation Approach

  1. Try Synchronous:
  2. CreateFile with normal flags, do WriteFile each frame.
  3. Measure durations. If it’s always <2 or 3 ms, likely no problem at 60 Hz.
  4. If you see random 100 ms stalls, you either accept the occasional stutter or move to asynchronous.

  5. If Stalls:
  6. Then do overlapped. Maintain, for instance, a ring of, say, 3–5 snapshot buffers. Each time you finish a frame, pick the next buffer, do WriteFile overlapped, queue the completion. If you loop back around to a buffer that’s still in progress, you stall until that I/O completes.
  7. Typically, you’ll never stall unless the disk truly can’t keep up with average throughput.

  8. (Optional) More advanced:
  9. If your throughput is borderline, consider chunking multiple frames into one larger write or compressing the snapshots in memory. But 2.5 MB/s is so small that it’s likely your disk can handle it with minimal overhead.


6) Memory Mapped Files?

You could also consider a memory-mapped file approach for writing. But for streaming large amounts of data, typical direct WriteFile calls are simpler and more predictable. The memory map might do some lazy writes or cause page faults. So not necessarily better for real-time streaming.


Conclusion

Your plan is exactly right:

  1. Start with synchronous WriteFile in normal buffered mode.
  2. Log your write durations.
  3. If they’re consistently short (likely on modern hardware), done—no complexity needed.
  4. If you see stutters or big outliers, consider overlapped writes with a ring buffer.
  5. CreateFile is indeed the standard method. FILE_FLAG_NO_BUFFERING can be beneficial in special contexts, but for a small 40 KB repeated write at 2.5 MB/s, the OS caching usually helps.

Hence you can proceed with confidence, measure, and only adopt asynchronous I/O if the real data demands it.

Friday, February 21, 2025

Multi-threaded image capture in Direct2D

CK: So I have the whole “bitmap capture” and export solution in place. It works fine. But I want to verify my solution with you. I went with the split solution, where the rendering thread does everything up to the copy from (GPU) target bitmap to CPU-readable bitmap. It then posts the readable bitmap to the main thread (via post message) and never touches it again.

When the main thread handles the bitmap capture message, it does the map, and then does all the WIC stuff, writing the image to disk. It then commits, unmaps, releases the bitmap pointer, and that’s all.

This is a handy division of labor, because the map operation is what actually copies the image from GPU to CPU memory, and that part can be quite slow. So it’s very convenient that that’s occurring on the main thread’s time, because that way it doesn’t disrupt rendering.

My understanding is that this scheme does comply with the COM rules. If I use COM without enabling multi-threaded mode, then I am responsible for preventing race conditions. In other words, I am responsible for ensuring that any given COM instance is accessed by one and only one thread at any given time.

My scheme certainly ensures that. As I said, once the render thread posts the bitmap, it never touches it again. The simple fact of queuing via the windows message loop ensures there is no possibility whatsoever for a race.

Do you agree with the above?

The mere fact that the scheme appears to work proves nothing of course, with multi-threading we must be on sure footing conceptually.

ChatGPT o1: Yes, your handoff scheme should be perfectly valid under Direct2D/COM’s single-thread concurrency rules, as long as:

  1. The render thread never touches the CPU-readable bitmap object again after posting it to the UI thread.
  2. The UI thread fully “owns” that object from that point on (maps it, does WIC, unmaps, releases).

That ensures there is no concurrent access at all—exactly what COM’s single-thread apartment model expects. Below is a bit more detail on why it’s safe:

1) COM Single-Thread Rules

  • In default (STA) COM: Each COM object instance is associated with the thread that creates it, and is not automatically “marshaled” to other threads.
  • However, Direct2D and DXGI objects are somewhat “special.” They aren’t strictly STA for all their internal logic, but the official stance is that any one ID2D1 object is not guaranteed to be thread‐safe. So if you want to pass it across threads, you must ensure there’s no overlapping usage.

Your usage pattern is:

  1. Render thread calls CreateBitmap(..., CPU_READ | CANNOT_DRAW), then does CopyFromBitmap(...).
  2. Render thread posts the resulting ID2D1Bitmap1 * to the UI thread via a Windows message.
  3. Render thread never touches that bitmap pointer again.
  4. UI thread receives the message, does Map(...), WIC encode, Unmap(), Release(), etc.

No concurrency occurs. Each piece of code “owns” the bitmap at different times, with no overlap.

2) Typical Precedent in D2D/DX

  • This pattern is quite similar to a “CPU readback” approach in D3D, where you create a staging resource on one thread, then pass it to another for final saving. The key is that only one thread interacts with that resource at a time.
  • The D2D docs mention that ID2D1Bitmap1 is not thread‐safe, but it’s fine if you exclusively access it from a single thread (the UI thread), after you’re done setting it up on the render thread.

3) The Windows Message “Boundary”

Posting a pointer in a WM_USER + X message is a classic way to pass COM pointers between threads. Since your render thread is done with that pointer, the UI thread can safely do Map(...) and so on. The Win32 message queue ensures a “happens-before” relationship—by the time the UI thread processes that message, the render thread has finished. That eliminates any race condition concerns.

4) Summary

Yes, the fact that you see it “working” is consistent with being on correct conceptual footing: you are ensuring that once the resource is handed off, only the UI thread uses it. That’s exactly the recommended approach for single-thread concurrency in Direct2D/COM. You avoid stalling your render loop (the slow disk I/O or GPU→CPU copy occurs on the UI thread’s time) and you keep the concurrency model simple.

So, your scheme is indeed safe and correct for the reasons you outlined. Nice work!

Thursday, February 20, 2025

Whorld V2 Roadmap

The alpha version of V2 is done, and being tested. As with FFRend, rendering runs entirely in its own worker thread. The renderer doesn’t even have a Windows message loop, so it cannot be blocked by the UI. The UI can only communicate with the renderer via a thread-safe command queue. The renderer doesn’t have a timer either, and that’s why there’s no frame rate setting. Instead the renderer runs at the monitor’s display frequency, via a DXGI swap chain, just as game developers do. Whorld has essentially been restructured into a modern Windows game.

Whorld V2 now renders entirely via Direct2D, which is fully hardware-accelerated. Direct2D is a wrapper around Direct3D, so we’re running Direct3D under the hood. Tessellation is on the CPU, possibly on multiple cores, but drawing is on the GPU. But I digress. Here’s a quick roadmap for what lies ahead:

  1. MIDI mapping. This will be a complete do-over. I plan to follow the same paradigm used in my Polymeter app. The advantage of that paradigm is flexibility. A control can map to multiple targets, a target can map to multiple controls, all bases are covered. I expect MIDI mapping to be a long-ish ride because V2 is multithreaded, which adds spice. With the MIDI revamp will come a new Options dialog, based on standard property sheets, again borrowed from the Polymeter project. Various other features are also waiting on an options dialog.

  2. Mirroring. I had a long chat with GPT about this last night. For programming projects, I talk to version o1; slow but comprehensive. It was a bit vague at first, but I busted its imaginary balls a bit, and it coughed up some solid-looking Direct2D code for mirroring. The method is quite complex and may affect performance, but it’s definitely worth a try. The core problem is that there’s no way to copy a rectangle within the back buffer anymore. That went away with DirectDraw and the move to GPUs. In D2D a buffer can be a source or a destination, but never both. So instead, we must use command lists, which are an advanced feature of D3D. I expect some R&D and a fairly steep learning curve.

  3. Playlists. While we’re redoing the MIDI mapping, we should reconsider the Playlist file format, which is how OG Whorld stored its MIDI settings. That file format is a mess and needs a rethink.

  4. Image export. Exporting a PNG will be fairly easy. The tricky part is learning to use WIC, but GPT will help with that. In fact I already have a basic export working as of today, only the UI remains to be done.

  5. Snapshots A snapshots is Whorld's vector format. It captures the entire ring stack, along with all related state information, in a highly compact form that’s losslessly scalable because no pixels are involved. In the original version, snapshot format depends heavily on Whorld’s internal structure, which makes the format somewhat fragile. Perhaps we can do better this time. I have considerable personal motivation to work on snapshots, because I have a large and beloved collection of snapshots, some of which have potential as art, and they will definitely look vastly better with antialiasing.

  6. TRANSPARENCY. It’s got the word “trans” in it, so we gotta go for it. It will hurt performance, but it’ll look so cool it’ll be worth doing a low ring count. Transparency will definitely work in fill mode, and that will be the most awesome use of it, because you will partially see through the “cone” revealing structure that was previously hidden. I envision two types: One where we change the alpha of the entire ring list, and another where the rings gradually get more transparent as they migrate further from the center. Both have potential.

After that, I’m not sure. XOR is vaguely possible, but it would be hard. D2D supports other blending modes, just not that one. But there is the option to create a so-called “effect” which is a type of custom blending mode. It means writing shader code. With GPT’s help, I can probably manage it, but not soon.

Tuesday, February 11, 2025

Whorld V2 is coming!

Whorld V2 renders much faster, allowing higher ring counts and frame rates, and also features anti-aliasing, eliminating jaggies. The UI is completely revamped, with many enhancements such as dockable panes, unlimited undo, and much improved handling of full screen mode, in both single and dual monitor configurations.

It's been a long time since there was news on the Whorld front. That's partly because I've been busy with other projects, but also because I needed to learn new skills. Under the hood, all of the graphics are now done in Direct2D 1.1, and the design is fully multi-threaded so that the UI and the rendering are completely disentangled. Rendering is silky smooth and stable at 60 frames per second, even with high ring counts.

Here's the done list:

  • Direct2D 1.1 initialization and rendering structure (mostly borrowed from FauveEdit).
  • Multithreaded design - all Direct2D access is by a worker thread (render thread).
  • Full screen and windowed modes, in both single and dual monitor configurations.
  • Parameter editing - via row dialog, imported from legacy version, with a few tweaks.
  • Data architecture - CWhorldBase, with structures to organize all app data.
  • Display frame rate in status bar - measure elapsed time via performance counter.
  • Routing of document update notifications, adhering to the MFC SDI scheme.
  • Undo system - imported from other projects, integrates with MFC.

Still to do:

  • Draw the Whorld graphics, replacing the placeholder drawing - first the basics, then:
    • Bézier curves.
    • Fill mode (as opposed to line mode).
    • Quad mirroring.
    • XOR mode (would require custom a Direct2D effect).
  • Options dialog, probably based on CMFCPropertyGridCtrl as in the Polymeter app.
  • MIDI support, with "learn" mode - much can be imported from the Polymeter app.
  • Bitmap export - somewhat complicated by multithreaded design.
  • Snapshot read/write/display - a vector-based format that captures the graphics state.