Wednesday, March 05, 2025

PushCommand: Handling render command queue full

CK: Based on all we’ve discussed previously, let’s review my PushCommand method. Thank you for your suggestions, several of which are incorporated. In particular, exponential backoff is implemented. Bear in mind that only three threads are involved: the main thread, the render thread, and the MIDI input callback. Note also that the render thread never calls PushCommand; it role is strictly to drain the command queue, once per frame. The only callers of PushCommand are the main thread (in response to user activity) and the MIDI input callback (in response to MIDI messages from a control surface).

The handling for a failed push—due to the command queue being full—is as follows:

  • For the main thread: Simply display a retry/cancel message box.

  • For the MIDI input thread: Retry the push, with an exponentially increasing timeout (Sleep) between each retry. The first Sleep is zero (relinquishing the remainder of the time slice), after that it’s 1, 2, 4, 8, 16 etc. milliseconds. If the total timeout reaches the limit (250 ms), enter an error state.

This is the “all retries failed” error state. In this state, no retries are attempted for the next 2500 ms. This is essential, because otherwise PushCommand could block for every MIDI message. Instead, for the next 2500 ms, the MIDI input callback continues to map messages to commands, but discards the commands, and is thus unimpeded.

PushCommand posts a notification message to the main thread when it enters the error state. This main thread displays a dialog informing the user an error has occurred. It uses reentrance guards to ensure that only one dialog can be displayed at a time.

The user of the MIDI input device will be unable to control the rendering during these 2500 ms of error state. This is by design, and it is no great disaster.

Let’s assume MIDI input continues to arrive and gets mapped to commands. One of two things will happen:

  1. The render thread will catch up and empty the queue. In this case, the next PushCommand will succeed, exiting the error state.

  2. Though the queue is still full, the 2500 ms break will expire. In this case, PushCommand will begin the retry loop again.

Some other things to note about this code:

  • Under no circumstance will PushCommand notify the main thread more often that once every 2500 ms.

  • The most likely execution path (the enqueue succeeds on the first try) is full optimal; the error handling imposes no additional overhead of any kind in that case.

  • This code is not merely hypothetical; it has been tested extensively using a so-called “torture test” consisting of a separate application that fires MIDI messages at the visualizer app, along with a temporary kludge in the visualizer app that causes it to render unusually slowly. The visualizer was reduced to frame rates below 1 Hz for extended periods, but no harmful effects were noticed: the MIDI input continued to arrive normally, the OS did not complain, and no anomalous thread interactions (such as priority inversion) occurred.

And without any further introduction, here is the code of PushCommand:

bool CWhorldThread::PushCommand(const CRenderCmd& cmd)
{
	while (!CRenderThread::PushCommand(cmd)) {	// try to enqueue command
		// enqueue failed because render command queue was full
		if (CWhorldApp::IsMainThread()) {	// if we're the user-interface thread
			// if not already handling render queue full error
			if (!theApp.GetMainFrame()->InRenderFullError()) {
				// give the user a chance to retry enqueuing the command
				if (AfxMessageBox(IDS_APP_ERR_RENDER_QUEUE_FULL, MB_RETRYCANCEL) != IDRETRY) {
					return false;	// user canceled, so stop retrying
				}
			}
		} else {	// we're a worker thread
			// all times are in milliseconds
			const UINT	nMaxTotalTimeout = 256;	// maximum total duration of retry loop
			const UINT	nRetryBreakTimeout = 2500;	// duration of break from retries
			LONGLONG	nTimeNow = static_cast<LONGLONG>(GetTickCount64());
			// if we're in an error state, taking a break from doing retries
			if (m_nLastPushErrorTime + nRetryBreakTimeout > nTimeNow) {
				return false;	// push command fails immediately, no retries
			}
			// do a limited number of retries, separated by increasing timeouts
			UINT	nTotalTimeout = 0;
			UINT	nTimeoutLen = 0;	// zero means relinquish remainder of time slice
			// while total time spent sleeping remains within limit
			while (nTotalTimeout + nTimeoutLen < nMaxTotalTimeout) {
				Sleep(nTimeoutLen);	// do a timeout of the specified length
				if (CRenderThread::PushCommand(cmd)) {	// retry enqueuing command
					// success: clear error state by zeroing time of last error
					InterlockedExchange64(&m_nLastPushErrorTime, 0);
					return true;	// retry succeeded
				}
				// retry failed: increase timeout and try again if permitted
				nTotalTimeout += nTimeoutLen;	// add timeout to total time slept
				if (nTimeoutLen) {	// if non-zero timeout
					nTimeoutLen <<= 1;	// double timeout (exponential backoff)
				} else {	// zero timeout
					nTimeoutLen = 1;	// start doubling from one
				}
			}
			// all retries have failed, so take a break from doing retries,
			// to avoid blocking the worker thread on every attempted push
			InterlockedExchange64(&m_nLastPushErrorTime, nTimeNow);
			// notify main thread that an unrecoverable error occurred
			PostMsgToMainWnd(UWM_RENDER_QUEUE_FULL);
			return false;	// we are in the retries failed error state
		}
	}
	return true;
}  

GPT: Overall, this code cleanly implements your stated goals and should be stable under the conditions you’ve described. Here are a few concise observations and potential refinements:

  1. Exponential Backoff Logic
    • You start with Sleep(0) (yield) and then escalate to 1 ms, 2 ms, 4 ms, etc., up to a total of 256 ms. This is a good balance between not hogging the CPU and quickly recovering when the queue frees up.
    • If you need finer control, you could add a limit to the backoff (e.g., never exceed 16 ms) so you don’t wait too long in one step. But that’s optional.
  2. Avoiding Frequent Error Dialogs
    • You’ve addressed the “perpetual message box” problem by allowing only one dialog at a time, and by blocking further retries for 2500 ms after failure. This ensures you won’t spam the user, and that the MIDI callback is never permanently tied up once it enters the error state.
    • If you need even fewer dialogs (because 2.5 seconds might still produce multiple popups in rare edge cases), you could keep track of whether you’ve already posted an error for the current break period, and skip posting again. You’re effectively doing this by resetting m_nLastPushErrorTime, so it’s likely fine.
  3. Main Thread vs. MIDI Callback Differences
    • The main thread can keep retrying indefinitely (user chooses “Retry”), which is acceptable for non-real-time code.
    • The MIDI callback bails out at 250 ms of total retry time, which avoids excessive blocking. If it fails, you enter an error state for 2.5 seconds and drop incoming commands during that time. This is a very reasonable compromise in a scenario where losing a few MIDI commands isn’t catastrophic.
  4. No Extra Overhead on Success
    • On the most common path (queue not full), CRenderThread::PushCommand(cmd) returns true immediately, and your code imposes practically zero overhead. This is ideal in normal operation.
  5. Edge Cases
    • Wrap-around of GetTickCount64() is effectively a non-issue on modern systems (it’s 64-bit, so it won’t wrap for ages). If you were using 32-bit GetTickCount, you’d need extra care every ~49.7 days.
    • If the queue remains completely full for a very long time (e.g. user sets a bizarrely slow frame rate), you’ll keep going into the error state and discarding MIDI commands. That’s intentional and safe, but the user could remain effectively locked out of MIDI control. This is consistent with your design, just be sure to handle or log that scenario so it’s not mysterious.

In short, your approach should work well under real-world conditions for a visualizer. The code is clear and handles worst-case scenarios gracefully—there’s no risk of indefinite blocking, yet the user is informed if the queue remains blocked. If you need more fine-grained backoff tuning or alternative error-notification logic, that’s straightforward to tweak, but as is, your design is sound.


CK: Thank you for that appraisal. I’m glad to hear that you agree that the design is sound. But we’re still not quite there. I feel that a bit more work is needed on the user interface side. The reentrance guards are not quite enough. In the case where the system is seriously backed up (due to a very low frame rate and a consistently high data rate from the MIDI input device) the main thread could still show dialogs too often. The user presses OK to close the dialog, and 2500 ms the app shows the same dialog again. This is annoying and unprofessional, and better design is certainly possible.

The fault lies not with PushCommand, which is only doing its job, but with the main thread. The main thread needs to be smarter about how it handles the notifications. My proposal is this:

After the main thread displays the “Render command queue is full” warning message box, but before it clears the reentrance guard flag, the main thread should check to see whether the command queue is still full. This is simply a read of the number of elements in the ring buffer. The answer doesn’t have to be perfect, so there’s no thread safety issue and hence no synchronization object is required. If the main thread finds that the command queue is still full, it should display a second “yes/no” message box, something like this: “Rendering is taking longer than usual. Would you like to reset to default settings?” And, if the user selects “yes”, the handler should display a modal dialog, and then try to load default settings. The loading may fail—due to the command queue being full—in which case the load should be repeated (after a brief timeout) until either it finally succeeds, or the user cancels the dialog. This will eventually clear the “logjam” and get things flowing again.

No comments: