In this article, we will begin the process of writing a microkernel OS for the Cortex-M4 by building some simple task switching logic. The initial goal for this OS is to have support for unprivileged tasks running in thread mode. We will use cooperative multitasking early on so that we can implement some I/O bound tasks, and we will extend this to a pre-emptive multitasking model later on using the system tick timer. In this article, we’ll write a very simple cooperative multitasking implementation, which we will extend to support a barebones implementation of the Actor Model in a subsequent article.

We need a few things to get started. First, we need to implement heap allocation by plugging in some facilities expected by Newlib. We’ll allocate control structures for tasks and task stack on the heap. Next, we need to implement a queue data structure which we’ll use in this article to manage tasks, and in the next article to manage messaging. Then, we will need to make some modifications to the linker control file and the interrupt vector table we implemented in the last article in order to support our expanded heap and the service call mechanism we will use to perform context switching. While we are at it, we will implement some dummy hooks for important exceptions so if something goes wrong, we can do some postmortem analysis with the debugger. Next, we’ll write code to set up the tasks and write a skeleton service call handler. We will run what we have so far on the board to make sure that we can enter and exit the service call handler successfully. Finally, we will implement the task switching logic for the service call handler and verify that we can successfully switch between two independent tasks. If all goes well, we will have the beginning of a simple microkernel that we can extend to do some real world work. So, let’s begin.

malloc and sbrk

Every C developer understands malloc. The malloc implementation for the build of Newlib that we configured in the last article is significantly scaled down. It is good enough for our purposes, but it’s of the ordinary variety. Memory regions allocated include an 8 byte overhead that maintains malloc’s internal bookkeeping. Realistically, one should avoid dynamic memory allocation in an embedded system. We are using malloc because it sidesteps the need for us to write our own slab allocator. We could save some overhead at the cost of flexibility if, for instance, we wrote a simple bump allocator. But, this would come at the cost of easy freeing of memory. In point of fact, we have to write a bump allocator anyway in order to use malloc at all, but this allocator will just do some very simple memory management of our heap and defer the bigger decisions to malloc itself.

Because Newlib is platform agnostic, it provides hooks that allows us to connect the C Runtime to the operating system, or in our case, bare metal. This is why, for instance, we had to write the _exit method in the last article. This method typically deals with the operating-system specific details about how to terminate a process. The malloc function requires a similar hook method, called _sbrk. The _sbrk method is a hook between Newlib and the OS for growing the size of the application’s heap by the given number of bytes. It returns the pointer to a memory region that is at least that number of bytes, or -1 if memory allocation fails.

Our version of _sbrk uses the symbols, __HeapBase and __HeapLimit, defined in our linker control file to grow the heap based on requests from malloc. When the heap size has been exhausted, _sbrk returns -1. This implementation is a very simple bump allocator that bumps along the heap_end pointer until the heap is exhausted.

char* heap_end = 0;
void* _sbrk(int incr)
{
    extern char __HeapBase;
    extern char __HeapLimit;
    char* prev_heap_end;

    if (heap_end == 0)
    {
        heap_end = &__HeapBase;
    }
    prev_heap_end = heap_end;

    if (heap_end + incr > &__HeapLimit)
    {
        return (void*)-1;
    }

    heap_end += incr;
    return prev_heap_end;
}

When malloc gets a -1 from _sbrk, it returns NULL to the caller. Alternatively, we could add a debug-time assertion to the body of the second if statement to catch an out-of-memory error at runtime. There is little headroom in any embedded application, so such an assertion can come in handy.

Queueing

We will use queues quite a bit in this OS. They give us a light-weight way to manage linear data. We can enqueue and dequeue items in $O(1)$ time, and we can even implement some more interesting views on top of this data structure for insertion, such as a priority queue, that allows us to enqueue an item with priority in $O(1)$ time. More on that in a later article.

The data structure itself is made up of a queue that holds nodes. The queue has a head and a tail. The nodes are double-linked list nodes, which have a prev and a next pointer. We could implement a simple queue using only a next pointer, but this would make it more difficult to implement some of the more interesting algorithms we will need later on. So, we’ll deal with the added overhead of maintaining a doubly-linked data structure.

/**
 * \brief queue node.
 *
 * Any data structure can extend a queue by adding this header to the beginning
 * of the structure.  The macros below automatically cast arguments to qnode_t,
 * which makes it easy to enqueue and dequeue items.
 */
typedef struct qnode
{
    struct qnode* prev;
    struct qnode* next;
} qnode_t;

/**
 * \brief queue.
 *
 * The queue has a head and a tail.  Items are normally added to the tail and
 * removed from the head.  There are two edge cases we must handle: if the queue
 * is empty, then both head and tail will be null.  In this case, the head needs
 * to be set to the tail.  If the last element has been removed from the queue,
 * then the head will be null.  In this case, the tail needs to be set to the
 * head.
 */
typedef struct qqueue
{
    qnode_t* head;
    qnode_t* tail;
} qqueue_t;

For now, we only need three operations. QINIT initializes a new queue. QENQUEUE adds an item to the end of the queue. QDEQUEUE removes an item from the front of the queue and assigns the variable provided as the second argument to this item. These operations are implemented as macros to force inlining which is important when operating under a critical section. We want to begin a critical section, do a minimal amount of work, and release this critical section as quickly as possible. Commonly, we will be doing one of three things under a critical section: dequeuing, enqueuing, or dequeuing and immediately enqueuing. This latter operation might be to shift one item from one queue to another, or it might be to shift one item from a global to a queue and another from the queue back to the global. For instance, the current_task pointer will be such a global. During a task switch, the task pointed to by the current_task pointer will be added to a process queue – either the run queue or a block queue. Another task will be removed from a queue to become the current task, and the current_task pointer will be updated to reference this task. Later on, we will use a similar mechanism of shifting items between queues and pointers, as well as copying data from one task space to another, to facilitate message passing.

/**
 * Initialize a queue.
 */
#define QINIT(queue) \
    (queue)->head = (queue)->tail = NULL

/**
 * Enqueue an element; setting the head to the tail if the queue is empty.
 */
#define QENQUEUE(queue, elem) \
    if ((queue)->tail) \
        (queue)->tail->next = (elem); \
    else \
        (queue)->head = (elem); \
    (elem)->prev = (queue)->tail; \
    (elem)->next = NULL; \
    (queue)->tail = (elem)

/**
 * Dequeue an element; setting the tail to NULL if this is the last element.
 */
#define QDEQUEUE(queue, elem) \
    if ((queue)->head) \
    { \
        (elem) = (queue)->head; \
        (queue)->head = (queue)->head->next; \
        (elem)->next = (elem)->prev = NULL; \
        if (!(queue)->head) \
        { \
            (queue)->tail = NULL; \
        } \
        else \
        { \
            (queue)->head->prev = NULL; \
        } \
    } \
    else \
        (elem) = NULL

These queue operations should be familiar to anyone who has worked with the data structure before. Most of the complexity comes from handling edge cases. There are two of note. The first is adding an element to an empty queue. The second is removing the last element from a queue. In both cases, the head and tail pointers must be cleaned up to maintain the two possible states of the data structure. We’ll add operations to splice queues, and add / remove elements from the middle of a queue in a later article. For now, this is sufficient for our needs.

Linker Control Changes

In the previous article, we implemented a linker control file for the K22 part on the K22F board. The heap size in that file was large enough for testing but not large enough for this example. To add some headroom, we’ll increase this to 32k.

SZ_HEAP                 = DEFINED(__sz_heap__)         ? __sz_heap__  : 0x8000;

Interrupt Vector Changes

We need to modify the interrupt vector table to implement the Supervisor Call mechanism and to handle a few potential exceptions. We will add stubs for these exceptions which will loop forever if the exceptions are hit. If the CPU locks up, we will see the exception we hit when we break in.

The Hard Fault exception occurs when the handling of one fault causes another fault, if there is a bus error when attempting to handle another type of exception, or if an attempt is made to make a Supervisor Call by code operating at the priority of this call or higher. The Bus Fault exception occurs when an attempt is made to access an invalid memory region, when a transfer size is invalid, or when the processor is in the wrong privilege level to write to a particular hardware register. The Usage Fault exception occurs when an undefined instruction is encountered, when an invalid interrupt return address is used or when the saved program counter for the interrupt return is invalid, or when an unaligned memory address is used for load / store multiple instructions. This fault can also be configured to trigger during a divide-by-zero or an unaligned memory access. In each of these cases, there is little we can do at this point other than to halt the processor and break in to see why the exception occurred.

The Supervisor Call exception occurs when the SVC instruction is executed. This mechanism allows unprivileged code to call into the kernel. There are several different ways this instruction can be used. An optional byte argument to the instruction provides a mechanism to identify different supervisor calls. However, we will be treating this call as if it were an EABI method call. We will reserve the first argument, in R0, for the service method and submethod. Three additional arguments, passed in R1, R2, and R3, will provide arguments to the service method. Since this follows the EABI for ARM, this corresponds to the first four arguments to a C function, as long as the arguments are word size or smaller. As such, with this bit of hackery, we can treat our svc_handler method as a native C method. We’ll wrap this method with a simple thunk that will preserve the current task’s context on entry and then restore the (potentially different) current task’s context on exit, but other than this low-level hackery, the calling convention will follow EABI.

Finally, we will stub the PendSV handler for now. In a future article, when we start enabling interrupts that may preempt each other, we will move the context switch to our pendsv_handler, which will ensure that high-priority interrupts are serviced by the user-mode drivers before lower-priority interrupts.

Here are the changes to the first 16 interrupt vectors required for our example. Note that the address we want for our assembler handlers is actually the address incremented by 1, which sets the low bit of the address. This tells the CPU that these are Thumb mode functions.

#define THUMB_METHOD(x)         (x)+0x01

    .section .isr_vector, "a"
    .align 2
    .global __isr_vector
__isr_vector:
    .long   __StackTop                      /* Top of Stack */
    .long   _start                          /* Reset handler */
    .long   0x0                             /* NMI Handler -- disabled */
    .long   THUMB_METHOD(hard_fault)        /* Hard Fault Handler -- stubbed */
    .long   0x0                             /* MPU Fault Handler -- stubbed */
    .long   THUMB_METHOD(bus_fault)         /* Bus Fault Handler -- stubbed */
    .long   THUMB_METHOD(usage_fault)       /* Usage Fault Handler -- stubbed */
    .long   0x0                             /* Reserved */
    .long   0x0                             /* Reserved */
    .long   0x0                             /* Reserved */
    .long   0x0                             /* Reserved */
    .long   THUMB_METHOD(svc_handler_thunk) /* SVCall Handler */
    .long   0x0                             /* Debug Monitor Handler */
    .long   0x0                             /* Reserved */
    .long   THUMB_METHOD(pendsv_handler)    /* PendSV Handler -- stubbed */
    .long   0x0                             /* SysTick Handler -- disabled */

The stub methods just enter an infinite loop. We implement basically the same code four times so that we can take advantage of the symbol printed by GDB when we break into one of these faults to give us the first hint regarding how to resolve the fault.

    .section "text", "a"
    .global hard_fault
hard_fault:
    b hard_fault

    .global bus_fault
bus_fault:
    b bus_fault

    .global usage_fault
usage_fault:
    b usage_fault

    .global pendsv_handler
pendsv_handler:
    b pendsv_handler

We will examine the svc_handler_thunk function in another section below.

The Simplified Task

For this example, we will build a simple task data structure that provides just enough information to implement cooperative multitasking. By giving the task_t structure a qnode_t header, we can cast it down to a queue node, which will allow us to enqueue and dequeue tasks on the task queue. We will add a flags attribute, a pointer to the stack for this task, the entry for this task, an opaque argument that can be passed to the task as a context, and a queue of tasks blocked by this task. The latter queue will be used later on when we implement messaging to track tasks trying to send messages to this task if the message queue for this task is full.

The task_entry type is used to define the entry point for a task. It takes one argument, which is an opaque pointer to a user-defined context structure.

typedef void (*task_entry)(void*);

typedef struct task
{
    qnode_t hdr;
    uint32_t flags;
    uint32_t psp;
    task_entry start;
    void* context;
    qqueue_t blockers;
} task_t;

Task Switching Structures

To support task switching, we instantiate a queue for managing tasks and a global pointer to the current task.

qqueue_t task_queue;
task_t* current_task = NULL;

The kernel will keep current_task pointed to the currently executing task and will use task_queue for tracking tasks ready to execute.

The ARM Embedded Application Binary Interface

Before we dig into task initialization, we need to consider the ARM EABI. This summary is by no means a complete description, but it is enough for our purposes. The R0, R1, R2, and R3 registers are used for passing parameters and as scratch registers in a function. When a function is called, these registers must be preserved if they are to be used after the function returns. These four registers are also used for return values. R0 holds a 32-bit return value. R0 and R1 are used to hold a 64-bit return value. The other registers, with the exception of R12, should be preserved by the called function if they are changed and restored before the called function is returned. The R12 register acts as an intra-procedural scratch register. Generally, this register isn’t used for much in normal application code, but it is preserved for an interrupt / supervisor call. The other registers are callee-save registers. They can be used by the callee, but the callee must preserve them before returning control to the caller.

Interrupts and exceptions in the Cortex-M architecture automatically preserve the R0 through R3 registers on the currently active stack, as well as the xPSR register, the PC register, the LR register, and the R12 register. If any other registers will be used by the interrupt handler, these must be preserved in code. By design, this automated preservation method matches the EABI specification, which means that we can use C functions compiled for the ARM EABI specification directly as interrupt handlers. Unlike other platforms, where some thunk code is needed to enter a C function from an interrupt, the ARM hardware takes care of all of the heavy lifting for us.

Task Context Save and Restore

That being said, additional work is necessary to preserve the current task context prior to initiating a task switch. Specifically, the R4 through R11 registers must be preserved. The other registers are preserved automatically by the hardware. Code executing in an interrupt runs in privileged mode, which means that stack changes are reflected on the MSR stack. To preserve these registers, we must store them on the shadowed PSP stack, which is used by the task. Hence, we make use of a simple thunk method called svc_handler_thunk to wrap the svc_handler method. This thunk manages the task context save and restore, and then the svc_handler method handles the mechanism of the task switch as appropriate.

In the svc_handler_thunk, we start by preserving the LR register on the MSP stack. The value provided by the LR register on entry is a special value used by the CPU to re-enter the unprivileged task. When this value is encountered as a jump or branch link value in handler mode, the CPU automatically returns from an interrupt or calls the next linked interrupt of lower priority. We then save the remaining context of the unprivileged task that was not preserved by the interrupt by storing these registers to the PSP stack. We then call the C svc_handler method, which performs a context switch by changing the current task and updating the PSP register. Finally, on return from the svc_handler, we pop the original LR register, restore the context of the current task – which might have changed due to the svc_handler – and then return to this task by branching to the LR register.

    .global svc_handler_thunk
    .thumb
svc_handler_thunk:
    push {lr}
    /* save context */
    mrs r12, psp
    stmdb r12!, {r4-r11}
    msr psp, r12
    bl svc_handler
    pop {lr}
    /* restore context */
    /* TODO - move this to pendsv */
    mrs r12, psp
    ldmfd r12!, {r4-r11}
    msr psp, r12
    bx lr

We’ll tackle the svc_handler once we’ve done some task setup work.

Task Setup

To set up a task, we need to replicate the stack context that would exist after saving the context within the kernel. Entry into an interrupt or exception preserves, in stack order, the xPSR, PC, LR, R12, R3, R2, R1, and R0 registers. Our context saving routines preserve, in stack order, the R11, R10, R9, R8, R7, R6, R5, and R4 registers. Because our entry method is an EABI C function, we pass the context argument for this task entry method in the R0 register. The complete task_init method, which initializes a task which we then place on the task queue, is presented below. For illustration, the register values are placed onto the stack as they would be pushed on the stack by the CPU. Typically, an RTOS would create a set of structures to deal with the interrupt and context registers for the convenience of struct dereferencing. But, this code and the struct code both compile down to the same machine code once the GCC optimizer has a chance to examine it.

void task_init(task_t* task, size_t stack_size, task_entry entry, void* context)
{
    memset(task, 0, sizeof(task_t));
    task->psp = (uint32_t)malloc(stack_size) + stack_size;
    task->start = entry;
    task->context = context;
    QINIT(&task->blockers);

    /* set up task entry. */
    uint32_t* stack = (uint32_t*)task->psp;
    --stack; *stack = (uint32_t)0x21000000;                 /* xPSR */
    --stack; *stack = (uint32_t)task->start & 0xFFFFFFFE;   /* PC */
    --stack; *stack = (uint32_t)&task_end;                  /* LR */
    --stack; *stack = 0;                                    /* R12 */
    --stack; *stack = 0;                                    /* R3 */
    --stack; *stack = 0;                                    /* R2 */
    --stack; *stack = 0;                                    /* R1 */
    --stack; *stack = (uint32_t)task->context;              /* R0 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R11 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R10 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R9 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R8 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R7 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R6 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R5 */
    --stack; *stack = (uint32_t)0x00000000;                 /* R4 */
    task->psp = (uint32_t)stack;
}

This method references a task_end method, which is entered if the task exits. This method for now just enters an infinite loop, which we can use in this example to detect a task exit. This could easily be changed to a method that makes a supervisor call letting the kernel know that this task is exiting.

void task_end()
{
    /* TODO - turn into a SVC that removes the task. */
    while (1);
}

Critical Sections

When modifying the global task structure, we want to enter a critical section before making changes, do these changes as quickly as possible, then exit the critical section and re-enable interrupts before performing a task switch.

Two inline functions, sys_enter_critical, and sys_exit_critical, perform these operations.

inline void sys_enter_critical()
{
    asm volatile ("cpsid i\n\t");
}

inline void sys_exit_critical()
{
    asm volatile ("cpsie i\n\t"
                  "isb\n\t");
}

Supervisor Calls

We define two supervisor calls in this example. The first, sys_start, starts up the system and switches to the first task. The second, sys_yield, yields the current task and switches to the next task.

#define SVC_YIELD               0x00000000
#define SVC_START               0x00010000

The equivalent system calls are defined in the following inline functions.

inline void sys_start()
{
    asm volatile ("mov r0, 0x00010000\n\t"
                  "svc 0\n\t" ::: "r0", "lr");
}   

inline void sys_yield()
{   
    asm volatile ("mov r0, #00000000\n\t"
                  "svc 0\n\t" ::: "r0", "lr");
}

When the svc instruction is encountered, the Supervisor Call exception is triggered by the CPU, which transfers control via the svc_handler_thunk to the svc_handler.

Supervisor Call Handler

The svc_handler method manages two calls in this example. The SVC_START call begins task switching and enters the first task. The SVC_YIELD call switches from the current task to the next task in the queue, and places the old task back onto the queue.

void svc_handler(uint32_t r0, uint32_t r1, uint32_t r3, uint32_t r4)
{
    qnode_t* new_task;

    switch (r0)
    {
        case SVC_YIELD:
            sys_enter_critical();
            current_task->psp = sys_get_psp();
            QENQUEUE(&task_queue, (qnode_t*)current_task);
            QDEQUEUE(&task_queue, new_task);
            current_task = (task_t*)new_task;
            sys_set_psp(current_task->psp);
            sys_exit_critical();
            break;
        case SVC_START:
            sys_enter_critical();
            QDEQUEUE(&task_queue, new_task);
            current_task = (task_t*)new_task;
            sys_set_psp(current_task->psp);
            sys_exit_critical();
            break;
        default:
            break;
    }
}

Two helper methods, sys_get_psp and sys_set_psp, are needed for setting the PSP register. These are included below. We also include the MSP flavors of these methods, which we will need when setting up our system to run in unprivileged thread mode.

inline uint32_t sys_get_psp()
{       
    uint32_t result = 0;
            
    asm volatile ("mrs %0, psp\n\t" : "=r" (result));
            
    return result;
}           
        
inline void sys_set_psp(uint32_t val)
{   
    asm volatile ("msr psp, %0\n\t" : : "r" (val));
}

inline uint32_t sys_get_msp()
{   
    uint32_t result = 0;
    
    asm volatile ("mrs %0, msp\n\t" : "=r" (result));
    
    return result;
}   

inline void sys_set_msp(uint32_t val)
{   
    asm volatile ("msr msp, %0\n\t" : : "r" (val));
}

Thread Mode

To place the system into thread mode, we need to modify the CONTROL register to instruct the CPU to use the PSP register for threads. We also need to switch to unprivileged mode before initiating the first task switch. The following bits are needed for the CONTROL register.

#define USE_PSP_IN_THREAD_MODE      (1 <<  1)
#define THREAD_MODE_IS_UNPRIVILEGED (1 <<  0)

These operations require some instruction barrier magic. For that, we define a sys_isb method that executes a barrier instruction. The sys_set_control method updates the CONTROL register.

inline void sys_isb() 
{   
    asm volatile ("isb\n\t");
}   

inline void sys_set_control(uint32_t val)
{
    asm volatile ("msr control, %0\n\t" : : "r" (val));
}

In the main method in this example, we create two tasks, named atask and btask, which resolve to the entry methods, afunc and bfunc respectively. We also need an initial PSP stack, which we use ever-so-briefly before switching to our first task. We’ll set the size of the stack to 512 bytes, but this is much more than we’ll need.

#define PSP_STACK_SIZE              0x200

In the main method, we will set up the task structure, initialize our tasks, then enter Thread Mode, before issuing the sys_start Supervisor Call to switch to the first task.

int main()
{
    uint32_t pspStart;
    task_t* atask;
    task_t* btask;

    QINIT(&task_queue);
    
    atask = (task_t*)malloc(sizeof(task_t));
    task_init(atask, 0x200, &afunc, 0);

    btask = (task_t*)malloc(sizeof(task_t));
    task_init(btask, 0x200, &bfunc, 0);
                  
    QENQUEUE(&task_queue, (qnode_t*)atask);
    QENQUEUE(&task_queue, (qnode_t*)btask);

    pspStart = (uint32_t)malloc(PSP_STACK_SIZE) + PSP_STACK_SIZE;
    sys_set_control(USE_PSP_IN_THREAD_MODE);

    sys_isb();
    sys_set_psp(pspStart);
    sys_isb();

    sys_set_control(USE_PSP_IN_THREAD_MODE | THREAD_MODE_IS_UNPRIVILEGED);
    sys_isb();

    /* start the first task */
    sys_start();

    /* never reached */
    return 0;
}

Example Tasks

For this example, we’ll create an afunc task and a bfunc task that each increment a global variable and then yield. This will allow us to set breakpoints in each function, and then confirm that task switching is working from GDB just by tracking the global variables during each execution step.

The two global variables we track are simply named a and b.

volatile int a = 0;
volatile int b = 0;

The afunc and bfunc tasks are below.

void afunc(void* ctx)
{   
    while (1)
    {
        ++a;
        sys_yield();
    }
}   
    
void bfunc(void* ctx)
{   
    while (1)
    {
        ++b;
        sys_yield();
    }
}

Testing Task Switching

In order to test this task switching example, we will set a breakpoint on the line incrementing the a variable in the afunc task. When this breakpoint is hit, we will use the following commands in gdb to print the current values of the a and b variables. This snippet assumes that the breakpoint we created is breakpoint 1.

command 1
print a
print b
end

When running the debugger, we should now see the following output:

(gdb) cont
Continuing.

Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250        ++a;
$1 = 0
$2 = 0

At this point, the afunc task is about to increment a. The bfunc task has not yet run. Let’s continue.

(gdb) cont
Continuing.

Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250        ++a;
$3 = 1
$4 = 1

Now, the bfunc task has run and yielded, incrementing the b variable by 1 and the afunc task has also incremented the a variable. From here, the task switching will continue indefinitely with the current run queue:

(gdb) cont
Continuing.

Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250        ++a;
$5 = 2
$6 = 2
(gdb) cont
Continuing.

Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250        ++a;
$7 = 3
$8 = 3

This demonstrates that our cooperative multitasking example is working. In the next article, we will implement notifications and message passing, which will provide all of the OS-level facilities that we need to implement the Pomodoro clock.

unconceived.net

Task Switching on the Cortex M4