James Routley

12.7.2025

A while ago, I was part of a team developing embedded software. The software was deeply rooted in state machines - dozens of them—spread across multiple functions. While this architecture is common in embedded development, especially for systems without an operating system, I started to question: Is this really the clearest way to express control flow?

The state machines in our code worked fine, but understanding and maintaining them was often a headache. They lacked a linear flow, requiring mental juggling of flags, states, and transitions scattered across polling functions.

I kept thinking: Wouldn't this be easier if we could just write the logic like a sequential program—waiting for events and resuming where we left off?

Of course, the project didn’t allow us to use an RTOS. So, the conventional approach of using threads or blocking system calls to manage concurrency was off the table. Yet, I knew there had to be a middle ground.

Around that time, I had been using coroutines in languages like Python, JavaScript, Dart, and Rust. They allow you to pause and resume execution without relying on threads—offering a kind of cooperative multitasking.

It hit me: this coroutine pattern could be the perfect fit for our problem—providing concurrency without requiring an OS.

Before diving into a coroutine-based solution, let’s take a step back and look at a small toy example that illustrates the problem.

We want to implement an LED blinker with a user-controllable period, denoted as $p$. Initially, the LED blinks with a fixed period of 2 seconds. However, the user should be able to change this period at any time by pressing and holding a button. When the button is released, the LED should restart its blinking cycle with a new period equal to twice the duration the button was held ($p/2$).

To model this behavior, we can use two simple state machines, as shown in the following figure: Statemachine

The first state machine, led_blinker, consists of two states: LED_ON and LED_OFF. The system transitions from LED_ON to LED_OFF after a delay of $p/2$, and similarly from LED_OFF back to LED_ON after another $p/2$. Additionally, if a resetLed event is received from the second state machine, the led_blinker transitions immediately to the LED_OFF state, regardless of its current state. This state machine starts in the LED_OFF state.

The second state machine, button_record, also has two states: WAIT_BUTTON_PRESSED and WAIT_BUTTON_UNPRESSED. In the initial state, it waits for the user to press the button. Once the button is pressed, it records the current time $t_s$ and transitions to the WAIT_BUTTON_UNPRESSED state. In this state, it waits for the user to release the button. When the button is released, it captures the current time $t_e$, calculates the new half-period as $p/2 = t_e - t_s$, emits a resetLed event, and returns to the WAIT_BUTTON_PRESSED state.

Implementing polling based for arduino, would look some like this:

#define BUTTON_PIN 2

enum led_blink_state {
  STATE_LED_OFF = 0,
  STATE_LED_ON
};

uint64_t led_blink_duration_ms = 1000;
uint64_t led_blink_toggle_time = 0;
uint8_t reset_led_requested = 0;

enum button_record_state {
  STATE_WAIT_BUTTON_PRESSED = 0,
  STATE_WAIT_BUTTON_UNPRESSED
};

void setup() {
  led_blink_state = STATE_LED_OFF;
  button_record_state = STATE_WAIT_BUTTON_PRESSED;
  led_blink_toggle_time = millis() + led_blink_duration_ms;

  pinMode(LED_BUILTIN, OUTPUT);
  pinMode(BUTTON_PIN, INPUT_PULLUP);
}

void poll_led_blink() {
  static enum led_blink_state = STATE_LED_OFF;
  if (led_blink_state == STATE_LED_OFF) {
    digitalWrite(LED_BUILTIN, LOW);
  } else if (led_blink_state == STATE_LED_ON) {
    digitalWrite(LED_BUILTIN, HIGH);
  }

  
  if (reset_led_requested) {
      reset_led_requested = 0;
      led_blink_state = STATE_LED_OFF;
      led_blink_toggle_time = millis() + led_blink_duration_ms;
  } else if (millis() >= led_blink_toggle_time) {
    if (led_blink_state == STATE_LED_OFF) {
      led_blink_state = STATE_LED_ON;
    } else if (led_blink_state == STATE_LED_ON) {
      led_blink_state = STATE_LED_OFF;
    }
    led_blink_toggle_time = millis() + led_blink_duration_ms;
  }
}

void poll_button_record() {
  static enum button_record_state = STATE_WAIT_BUTTON_PRESSED;
  static int button_pressed_start_time = 0;
  if (button_record_state == STATE_WAIT_BUTTON_PRESSED) {
    if (digitalRead(BUTTON_PIN) == LOW) {
      button_record_state = STATE_WAIT_BUTTON_UNPRESSED;
      button_pressed_start_time = millis();
    }

  } else if (button_record_state == STATE_WAIT_BUTTON_UNPRESSED) {
    if (digitalRead(BUTTON_PIN) == HIGH) {
      button_record_state = STATE_WAIT_BUTTON_PRESSED;
      int button_pressed_end_time = millis();
      led_blink_duration_ms = button_pressed_end_time - button_pressed_start_time;
      reset_led_requested = 1;
    }
  }
}

void loop() {
  poll_led_blink();
  poll_button_record();
}

This implementation is almost a one-to-one translation of the state machines into C code. Mapping the diagram to code is relatively straightforward. However, once you look solely at the code, it becomes difficult to follow the actual behavior. That’s because there's no linear control flow in either the poll_led_blink or poll_button_record functions. Instead, they’re repeatedly called in a loop, checking the current state and reacting accordingly—which fragments the logic and makes it harder to reason about.

Wouldn’t it be simpler if each state machine function could just pause—waiting for something to happen, like a button press or release, a timer to expire, or a resetLed event—and then resume execution from that point onward? This kind of structure would allow us to write code that follows a clear, sequential flow. Implementing such behavior becomes quite straightforward when using FreeRTOS, by mapping each state machine to a separate task that can block while waiting for events.

#include <Arduino_FreeRTOS.h>

#define BUTTON_PIN 2

TickType_t led_blink_duration_ticks = pdMS_TO_TICKS(1000);

TaskHandle_t led_blink_task_handle;
#define NOTIFYBIT_RESET_LED 0x80

void led_blink() {
  while (true) {
    digitalWrite(LED_BUILTIN, LOW);
    
    if (xTaskNotifyWait(0, NOTIFYBIT_RESET_LED, NULL, led_blink_duration_ticks) == pdTRUE) {
      
      continue;
    }
    digitalWrite(LED_BUILTIN, HIGH);
    
    xTaskNotifyWait(0, NOTIFYBIT_RESET_LED, NULL, led_blink_duration_ticks);
  }
}

void wait_pin(int pin, int level) {
  while (digitalRead(pin) == level)
    ;
}

void button_record() {
  while (true) {
    wait_pin(BUTTON_PIN, HIGH);
    TickType_t start_time_ticks = xTaskGetTickCount();
    
    wait_pin(BUTTON_PIN, LOW);
    TickType_t end_time_ticks = xTaskGetTickCount();
    led_blink_duration_ticks = end_time_ticks - start_time_ticks;
    xTaskNotify(led_blink_task_handle, NOTIFYBIT_RESET_LED, eSetBits);  
  }
}

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
  pinMode(BUTTON_PIN, INPUT_PULLUP);

  xTaskCreate(led_blink, "led_blink", 512, NULL, 1, &led_blink_task_handle);
  xTaskCreate(button_record, "button_record", 512, NULL, 1, NULL);
}

void loop() {}

In my view, this approach is much easier to read and understand. It also eliminates the need to define a discrete state machine design upfront—you can simply express the logic directly as sequential code.

However, there's an important trade-off: this solution requires an operating system. Specifically, it relies on (preemptive) scheduling to switch between tasks, which means your project must include an OS like FreeRTOS.

So finally the implementation on my hacky macro-based coroutines, the rough structure of the task implementation is very similar to the freeRTOS based approach. Let's start with the implementation of the button_recorder first:

CORO(button_recorder_fn,
     CORO_NO_ARGS,
     CORO_LOCALS(uint64_t button_pressed_start_time;),
     CORO_CALLS(
        CORO_CALL(wait_pin_low, wait_pin),
        CORO_CALL(wait_pin_high, wait_pin)), {
       coro_res_t res;
       while (true) {
         CALL(res, wait_pin_low, wait_pin, BUTTON_PIN, LOW);
         LOCAL(button_pressed_start_time) = millis();

         CALL(res, wait_pin_high, wait_pin, BUTTON_PIN, HIGH);
         uint64_t button_pressed_end_time = millis();
         led_blink_duration_ms = button_pressed_end_time - LOCAL(button_pressed_start_time);
         coro_cond_var_notify(&reset_led_signal);
       }
       return CORO_RES_DONE;
     })

At first glance, this might look like just a wall of macros. But what's really happening here is that the macros are effectively transpiling the coroutine into an explicit state machine at compile time. To understand this, we need to take a brief detour into how normal function calls work and what we're doing differently here.

In a typical C function, the call stack is used to manage control flow and store local variables. Every function call pushes a stack frame onto the call stack, which contains the return address, parameters, and local variables. Once the function returns, the frame is popped off, and execution continues where it left off.

However, in our coroutine system, we can't rely on this built-in stack mechanism because we're pausing execution at arbitrary points and resuming later—potentially from a different part of the program loop. Instead, we manage our own manual control flow. All local variables that need to persist across await-style calls must be stored outside the standard stack—in a structure tied to the coroutine's context. That's what the CORO_LOCALS macro sets up.

Similarly, since there's no stack to implicitly remember where we left off (like a return address), we use an explicit state variable to represent the coroutine’s current execution point. Each possible CALL to a sub-coroutine defines a state, declared via CORO_CALLS, and each of those sub-coroutines has its own context as well. This is very much like manually implementing what the C compiler normally handles for you when it compiles regular function calls.

This pattern is reminiscent of a classic C trick known as Duff's Device—a technique that uses a switch statement combined with loop unrolling to implement a form of coroutine or co-operative multitasking. We’re using a similar mechanism here: a switch-based jump table that resumes execution at the exact point it was paused last time. To make this clearer, here’s what the button_recorder coroutine looks like after macro expansion:

enum button_recorder_fn_fct_state {
  button_recorder_fn_state_initial = 0,
  coro_state_wait_pin_low,
  coro_state_wait_pin_high
};
struct button_recorder_fn_fct_ctx {
  enum button_recorder_fn_fct_state state;
  uint64_t button_pressed_start_time;
  union {
    uint8_t _placeholder_;
    struct wait_pin_fct_ctx wait_pin_low;
    struct wait_pin_fct_ctx wait_pin_high;
  } calls;
};

coro_res_t button_recorder_fn_fct(struct button_task_fn_fct_ctx *ctx) {
  switch (ctx->state) {
  case button_recorder_fn_state_initial: {
    coro_res_t res;
    while (1) {
      
      ctx->state = coro_state_wait_pin_low;
      ctx->calls.wait_pin_low.state = (enum wait_pin_fct_state)0;
      case coro_state_wait_pin_low:
      res = wait_pin_fct(&ctx->calls.wait_pin_low, 2, LOW);
      if (res & CORO_RES_PENDING)
        return res;
      
      (ctx->button_pressed_start_time) = millis();
      
      
      ctx->state = coro_state_wait_pin_high;
      ctx->calls.wait_pin_high.state = (enum wait_pin_fct_state)0;
      case coro_state_wait_pin_high:
      res = wait_pin_fct(&ctx->calls.wait_pin_high, 2, HIGH);
      if (res & CORO_RES_PENDING)
          return res;
      
      uint64_t button_pressed_end_time = millis();
      led_blink_duration_ms =
          button_pressed_end_time - (ctx->button_pressed_start_time);
      coro_cond_var_notify(&reset_led_signal);
    }
    return CORO_RES_DONE;
  }
  }
}

In this form, it's clear how the macro system rewrites the coroutine into a state machine. Each CALL effectively becomes a case label in a switch, and ctx->state tracks which part of the function should run next time it's resumed. All local state and subroutine call contexts are persistently stored in a heap-allocated (or static) struct rather than on the call stack.

A similar idea is presented in the following blog article from Simon Tatham, which describes this pickpocket trick quite appropriately:

Of course, this trick violates every coding standard in the book. Try doing this in your company's code and you will probably be subject to a stern telling off if not disciplinary action! You have embedded unmatched braces in macros, used case within sub-blocks, [...] It's a wonder you haven't been fired on the spot for such irresponsible coding practice. You should be ashamed of yourself.

Let’s now examine the coroutine wait_ms, which delays execution for a specified number of milliseconds:

CORO(wait_ms, 
     CORO_ARGS(uint64_t delay), 
     CORO_LOCALS(uint64_t end_time;),
     CORO_CALLS(CORO_CALL(wait_ms_yield, coro_yield)), {
       coro_res_t res;
       LOCAL(end_time) = millis() + delay;
       while (LOCAL(end_time) >= millis()) {
         CALL(res, wait_ms_yield, coro_yield);
       }
       return CORO_RES_DONE;
     })

This coroutine waits until the current time surpasses a computed end time. During that time, it repeatedly yields via another coroutine called coro_yield, which typically represents a single iteration of cooperative scheduling—allowing other coroutines or main loop logic to run.

Let’s see what this expands to after the macros have been processed:

enum wait_ms_fct_state { wait_ms_state_initial = 0, coro_state_wait_ms_yield };
struct wait_ms_fct_ctx {
  enum wait_ms_fct_state state;
  uint64_t end_time;
  union {
    uint8_t _placeholder_;
    struct coro_yield_fct_ctx wait_ms_yield;
  } calls;
};
coro_res_t wait_ms_fct(struct wait_ms_fct_ctx *ctx, uint64_t delay) {
  switch (ctx->state) {
  case wait_ms_state_initial: {
    coro_res_t res;
    (ctx->end_time) = millis() + delay;
    while ((ctx->end_time) >= millis()) {
      
      ctx->state = coro_state_wait_ms_yield;
      ctx->calls.wait_ms_yield.state = (enum coro_yield_fct_state)0;
      case coro_state_wait_ms_yield:
      res = coro_yield_fct(&ctx->calls.wait_ms_yield);
      if (res & CORO_RES_PENDING)
        return res;
      
    }
    return CORO_RES_DONE;
  }
  }
}

What’s happening here is nearly identical in structure to the previous coroutine. The key idea is that the coroutine "remembers" where it was last suspended using the state enum, and all variables that need to persist across multiple invocations (such as end_time) live in the coroutine context.

The while loop is partially unrolled using this manual state machine, and the call to coro_yield is transformed into its own resumable block. This allows the wait_ms coroutine to sleep without blocking other coroutines—achieving a non-blocking delay in a completely single-threaded environment.

This is where the technique becomes really powerful. Even though we’re not using an operating system or real threads, we can simulate cooperative multitasking simply through clever code structure and macro expansion. By chaining coroutines together—each with their own saved state and local context—we can build complex behavior while keeping the control flow readable and sequential.

In essence, we’re trading stack frames and preemptive scheduling for persistent state and manual switching. The result is a system that feels intuitive to write, even though its implementation is deeply unconventional and, some might say, sacrilegious by conventional C standards.

We’ve seen how individual coroutine functions like wait_ms or button_recorder_fn are compiled down into state machines using macros. Now let’s look at how these coroutines actually run. That is, how they are scheduled, paused, resumed, and eventually completed. Here's the code that defines the core coroutine runtime system:

enum coro_task_state {
  coro_task_state_not_started = 0,
  coro_task_state_running = 1,
  coro_task_state_waiting_for_execution = 2,
  coro_task_state_parked = 3,
  coro_task_state_finished = 4,
  coro_task_state_failed = 5,
};

struct coro_task;

struct coro_task *current_task = NULL;

struct coro_executor {
  struct coro_task *task_queue_head;
  struct coro_task *task_queue_tail;
};

typedef coro_res_t (*coro_task_root_fct)(void *ctx);

struct coro_task {
  enum coro_task_state state;

  struct coro_executor *executor;
  struct coro_task *next;
  coro_task_root_fct root_fct;
  uint8_t canceled;
  void *context;
};

void coro_executor_enqueue_task(struct coro_executor *executor,
                                struct coro_task *task) {

  assert(task->state == coro_task_state_waiting_for_execution);
  assert(task->executor == executor);

  struct coro_task **task_queue_tail = &executor->task_queue_tail;
  if (*task_queue_tail)  
    (*task_queue_tail)->next = task;
  else  
    executor->task_queue_head = task;
  *task_queue_tail = task;
}

void coro_executor_start_task(struct coro_executor *executor,
                              struct coro_task *task) {

  assert(task->state == coro_task_state_not_started);
  assert(task->executor == NULL);
  task->state = coro_task_state_waiting_for_execution;
  task->executor = executor;

  coro_executor_enqueue_task(executor, task);
}

void coro_executor_process(struct coro_executor *executor) {
  struct coro_task **task_queue_head = &executor->task_queue_head;

  while (*task_queue_head != NULL) {
    struct coro_task *task = *task_queue_head;
    if (task->state == coro_task_state_waiting_for_execution) {
      task->state = coro_task_state_running;

      current_task = task;
      coro_res_t res = task->root_fct(task->context);
      if (res == CORO_RES_DONE) {
        task->state = coro_task_state_finished;
      } else if (res == CORO_RES_CANCELED) {
        task->state = coro_task_state_failed;
      } else if (res == CORO_RES_PENDING) {
        task->state = coro_task_state_parked;
      } else if (res == CORO_RES_PENDING_NON_PARKING) {
        task->state = coro_task_state_waiting_for_execution;
        coro_executor_enqueue_task(executor, task);
      } else {
        assert(0);
      }
    }
    *task_queue_head = task->next;
    task->next = NULL;
  }
  executor->task_queue_tail = NULL;
  current_task = NULL;
}


enum coro_yield_fct_state {
  yield_state_init = 0,
  yield_state_after_yield = 1,
};
struct coro_yield_fct_ctx {
  enum coro_yield_fct_state state;
};
coro_res_t coro_yield_fct(struct coro_yield_fct_ctx *ctx) {
  if (current_task->canceled)
    return CORO_RES_CANCELED;
  if (ctx->state == yield_state_init) {
    ctx->state = yield_state_after_yield;
    return CORO_RES_PENDING_NON_PARKING;
  } else {
    return CORO_RES_DONE;
  }
}

The coro_executor is a minimal scheduler. It maintains a queue of coroutine tasks ready to run.

When coro_executor_process is called, it:

Picks the first queued task.
Calls its coroutine function.
Checks the return value to determine what to do next:
- CORO_RES_DONE: The coroutine is finished.
- CORO_RES_CANCELED: The coroutine was canceled mid-way - we take a look at that behavior later.
- CORO_RES_PENDING: The coroutine is waiting and blocking until un-parked - more about parking later.
- CORO_RES_PENDING_NON_PARKING: The coroutine yielded voluntarily and should be resumed as soon as possible (e.g., in the next loop).
- If a task yields with CORO_RES_PENDING_NON_PARKING, it’s immediately re-enqueued.

Let’s now turn to the led_task, which is slightly more involved than the previous coroutines. This task’s goal is to blink an LED on and off with a duration controlled by the user—but with a twist: it should react immediately to a resetLed event, which may be sent at any time via a condition variable. That means during each LED on/off phase, the task must wait for two things in parallel:

A timeout (wait_ms) to elapse.
A signal (coro_cond_var_wait) indicating that the blinking period has changed.

Crucially, we don’t care which of these completes first—but once one completes, we want to cancel the other. For example, if the resetLed event is received, we no longer care about the timer finishing, and vice versa. That’s where the ANY_CALL macro and coroutine cancelation come into play.

Here’s what the coroutine looks like:

CORO(led_task_fn,
     CORO_NO_ARGS,
     CORO_NO_LOCALS,
     CORO_CALLS(
        CORO_ANY_CALL(wait_a, wait_ms, coro_cond_var_wait), 
        CORO_ANY_CALL(wait_b, wait_ms, coro_cond_var_wait)), {
       coro_res_t res_wait_ms;
       coro_res_t res_reset_led;
       while (true) {
         digitalWrite(LED_BUILTIN, LOW);
         ANY_CALL(res_wait_ms, res_reset_led, wait_a, 
            wait_ms, (led_blink_duration_ms),
            coro_cond_var_wait, (&reset_led_signal)
         );
         if (res_reset_led == CORO_RES_DONE) continue;
         digitalWrite(LED_BUILTIN, HIGH);
         ANY_CALL(res_wait_ms, res_reset_led, wait_b, 
            wait_ms, (led_blink_duration_ms), 
            coro_cond_var_wait, (&reset_led_signal)
         );
         if (res_reset_led == CORO_RES_DONE) continue;
       }
       return CORO_RES_DONE;
     })

The same code but expanded:

enum led_task_fn_fct_state {
  led_task_fn_state_initial = 0,
  coro_state_wait_a,
  coro_state_wait_b
};
struct led_task_fn_fct_ctx {
  enum led_task_fn_fct_state state;
  union {
    uint8_t _placeholder_;
    struct {
      struct wait_ms_fct_ctx a;
      struct coro_cond_var_wait_fct_ctx b;
    } wait_a;
    struct {
      struct wait_ms_fct_ctx a;
      struct coro_cond_var_wait_fct_ctx b;
    } wait_b;
  } calls;
};
coro_res_t led_task_fn_fct(struct led_task_fn_fct_ctx *ctx) {
  switch (ctx->state) {
  case led_task_fn_state_initial: {
    coro_res_t res_wait_ms;
    coro_res_t res_reset_led;
    while (1) {
      digitalWrite(LED_BUILTIN, LOW);
      
      
      
      
      ctx->state = coro_state_wait_a;
      ctx->calls.wait_a.a.state = (enum wait_ms_fct_state)0;
      ctx->calls.wait_a.b.state = (enum coro_cond_var_wait_fct_state)0;
      case coro_state_wait_a:
      res_wait_ms = wait_ms_fct(&ctx->calls.wait_a.a, led_blink_duration_ms);
      res_reset_led = coro_cond_var_wait_fct(&ctx->calls.wait_a.b, &reset_led_signal);
      if (res_wait_ms & CORO_RES_DONE && !(res_reset_led & CORO_RES_DONE)) {
        current_task->canceled++;
        res_reset_led =
            coro_cond_var_wait_fct(&ctx->calls.wait_a.b, &reset_led_signal);
        current_task->canceled--;
      } else if (res_reset_led & CORO_RES_DONE && !(res_wait_ms & CORO_RES_DONE)) {
        current_task->canceled++;
        res_wait_ms = wait_ms_fct(&ctx->calls.wait_a.a, led_blink_duration_ms);
        current_task->canceled--;
      }
      if ((res_wait_ms | res_reset_led) & CORO_RES_PENDING)
        return (res_wait_ms | res_reset_led);

      if (res_reset_led == CORO_RES_DONE)
        continue;
      
      digitalWrite(LED_BUILTIN, HIGH);
      
      
      
      
      
    }
    return CORO_RES_DONE;
  }
  }
}

The ANY_CALL macro launches two sub-coroutines in parallel and waits for either to complete. When one finishes, the macro manually increments the canceled flag of the current_task before running the still-pending coroutine again. This triggers coroutine-level cancelation.

Cancelation in this system is cooperative and opt-in. Each coroutine can check whether it's been canceled by inspecting the current_task->canceled flag. If set, the coroutine is expected to exit early (typically returning CORO_RES_CANCELED). This design allows you to safely "abort" a pending coroutine without requiring real preemption or risking inconsistent state.

One practical and critical use of this coroutine cancelation system appears in the implementation of condition variables, specifically within the coro_cond_var_wait coroutine.

A coro_cond_var maintains a linked list of waiters, where each waiter is a coro_cond_var_waiter struct embedded directly within the coroutine’s local context. This is an efficient way to avoid heap allocations, but it introduces a major caveat: if the coroutine is canceled or finishes before it’s signaled, and it does not clean up after itself, the list would contain a dangling pointer — a reference to a context that no longer exists. That’s undefined behavior waiting to happen.

To prevent this, coro_cond_var_wait takes special care to remove its waiter from the list if the coroutine is canceled before being signaled.

Here's how this works:

void coro_unpark_task(struct coro_task *task) {
  assert(task->state != coro_task_state_not_started);
  assert(task->executor != NULL);

  if (task->state == coro_task_state_parked) {
    task->state = coro_task_state_waiting_for_execution;

    coro_executor_enqueue_task(task->executor, task);
  }
}

enum coro_cond_var_waiter_state {
  coro_cond_var_waiter_idle = 0,
  coro_cond_var_waiter_waiting = 1,
  coro_cond_var_waiter_signaled = 2,
};

struct coro_cond_var_waiter {
  enum coro_cond_var_waiter_state state;
  struct coro_cond_var_waiter *next;
  struct coro_task *parked_task;
};

struct coro_cond_var {
  struct coro_cond_var_waiter *waiter_head;
};

void coro_cond_var_notify(struct coro_cond_var *cond_var) {
  struct coro_cond_var_waiter **waiter_head = &cond_var->waiter_head;
  while (*waiter_head != NULL) {
    if ((*waiter_head)->state == coro_cond_var_waiter_waiting) {
      (*waiter_head)->state = coro_cond_var_waiter_signaled;
      coro_unpark_task((*waiter_head)->parked_task);
    }
    *waiter_head = (*waiter_head)->next;
  }
}

static void coro_cond_var_add_waiter(struct coro_cond_var *cond_var,
                                     struct coro_cond_var_waiter *waiter) {
  struct coro_cond_var_waiter **waiter_head = &cond_var->waiter_head;
  waiter->next = *waiter_head;
  *waiter_head = waiter;
}

static void coro_cond_var_remove_waiter(struct coro_cond_var *cond_var,
                                        struct coro_cond_var_waiter *waiter) {

  for (struct coro_cond_var_waiter **waiter_head = &cond_var->waiter_head;
       *waiter_head != NULL; waiter_head = &(*waiter_head)->next) {
    if (*waiter_head == waiter) {
      *waiter_head = (*waiter_head)->next;
      break;
    }
  }
}

CORO(coro_cond_var_wait, CORO_ARGS(struct coro_cond_var *cond_var),
     CORO_LOCALS(struct coro_cond_var_waiter waiter;),
     CORO_CALLS(CORO_CALL(coro_cond_var_wait_park, coro_park)), {
       coro_res_t res;

       coro_cond_var_add_waiter(cond_var, &LOCAL(waiter));

       do {
         LOCAL(waiter).state = coro_cond_var_waiter_waiting;
         LOCAL(waiter).parked_task = current_task;
         CALL(res, coro_cond_var_wait_park, coro_park);
         if (res == CORO_RES_CANCELED && ctx->waiter.state != coro_cond_var_waiter_signaled) {
           coro_cond_var_remove_waiter(cond_var, &LOCAL(waiter));
           return CORO_RES_CANCELED;
         }
       } while (ctx->waiter.state != coro_cond_var_waiter_signaled);
       return CORO_RES_DONE;
     })

This mechanism highlights a broader theme in this coroutine system: every coroutine is responsible for cleaning up its own mess—especially in the face of cancelation. Since we're embedding waiter objects directly into coroutine stack frames (which may be on the heap or static memory), it’s essential to de-register those objects before the stack goes away.

Now we can sew everything together. The snippet below shows how the system is started and run in an Arduino-like environment:

DECLARE_TASK(led_task, led_task_fn);
DECLARE_TASK(button_task, button_task_fn);
DECLARE_EXECUTOR(exe);

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
  pinMode(BUTTON_PIN, INPUT_PULLUP);

  coro_executor_start_task(&exe, &led_task);
  coro_executor_start_task(&exe, &button_task);
}

void loop() {
  coro_executor_process(&exe);
}

Each co-routine task is declared and initialized ahead of time. When setup runs, the tasks are registered with the executor and marked as ready. In the loop function—which is repeatedly called by the Arduino runtime—the executor resumes each co-routine that is ready to make progress. If you are interested in the missing parts of the implementation - you'll find the full source code here.

Final words

This whole setup is an unholy alliance of C macros, state machines, and sheer willpower. It's clever, it's educational, and yes—it's kind of fun. But let’s be honest: this is not how sane people should write software in 2025.

If you're thinking, “Wow, that’s a lot of boilerplate just to blink an LED asynchronously,” you're absolutely right. What you've built here is essentially a poor man's async runtime. A very educational, very brave, very macro-ridden poor man's async runtime.

So to be a little bit salty:

Rust gives you everything we've scratched and clawed together here, but natively. Async/await is stackless and zero-cost, with real compiler guarantees, memory safety, cancelation semantics, and futures that compose cleanly—no unhygienic macro hell, no manual cancelation logic duct-taped together.

What we’ve done in C is a fascinating look under the hood. But if you're doing serious work, or you want to ship something that won’t wake you up at 3AM with a mysterious hard crash—save yourself the pain.

Update

After I had actually already completed the post, by chance I recently came across an alternative approach: Adam Dunkels beat me to it with his brilliant Protothreads, and his approach is arguably more elegant.

Instead of generating explicit state enums like we did, Protothreads cleverly use the __LINE__ macro to represent state—yes, the actual line number in the source file becomes the state machine’s program counter. It's an audacious hack that makes our elaborate macro gymnastics look almost… wholesome. You can find a full macro expansion example here

Embedded

Hacking Coroutines into C

Final words

Update