I finally got my hands on a Pico, the new microcontroller board from Raspberry Pi. They were sold out for quite a while pretty much everywhere I looked, but it's hard to complain when the board only cost $4.
Unlike the other Pi boards, this one is based on an SoC with a Cortex-M core, so it's more geared to running single small programs than big multi-programming operating systems.
The Raspberry Pi team offers a prebuilt binary of MicroPython, with built-in control of GPIOs. After soldering on some headers, you're only minutes away from something like:
>>> while True: ... machine.Pin(0, machine.Pin.OUT).toggle() ... time.sleep(1)
which I have to admit is pretty cool.
But obviously my real desire was to try out the C SDK. The SDK is a beast: around 2000 source files, plus required dependencies on CMake and the GNU ARM Embedded Toolchain. It's also pretty heavily opinionated: you're all but required to use CMake for your own code, and even the simple dozen-or-so-line "blink an LED" example builds some 47 source files—including a bootloader!
There are certainly good reasons for that, and the SDK does appear to make some really cool stuff available with little effort. But part of the joy of tinkering with these microcontroller boards is the ability to mostly avoid other people's code. So I took a different approach. Luckily, the Pico is admirably well documented, between Raspberry Pi's datasheets on the board and the SoC, and ARM's Cortex-M0+ documentation.
For example, the SoC datasheet tells us that, on power up, the core starts executing in the RP2040 boot ROM, which reads the first 256 bytes from the (off-chip, serial-attached) flash into SRAM. The last four bytes of that chunk are reserved for a CRC32, which must validate before the ROM will jump to it.
In a normal Pico C program, these 256 bytes are provided by the aforementioned bootloader. Its job is to make the rest of the program ready to run. Whereas a typical bootloader might simply read more data from disk, the Pico's bootloader can take advantage of the RP2040's execute-in-place functionality. In short, the external flash is fully memory-mapped, and it's fast enough simply to place the program counter in the flash memory range, and let it rip. Basically all the bootloader has to do here is to configure the hardware to communicate with the flash (over serial) and jump to the program.
But for this first program, I decided to try doing without the bootloader. For one thing, using the bootloader was a soft violation of the only my code goal. But moreover, using the bootloader would complicate my program: it would require prepending it at build time, and it also meant the main program needed an interrupt vector table (since one thing the bootloader does is to replace the ROM's vector table with the program's). And anyway, I was setting out to write a simple program that could definitely fit into the 256 bytes I could use before a bootloader was needed.
I proceeded to look into how to control the GPIO pins. On other microcontroller boards, this is usually just a matter of setting a few (usually memory-mapped) registers. Looking through the RP2040 datasheet for the relevant info, I was initially only able to find mention of the programmed I/O (PIO) blocks. These are basically versatile I/O coprocessors that can very precisely manipulate the GPIOs. These coprocessors have their own (very minimal) instruction set, and the Pico SDK even builds an assembler that spits out PIO machine code as a C header file.
Very cool—but, again, it's way beyond what I wanted here. It took me a while of flipping between datasheets, but I eventually figured out that there's a much simpler path to the GPIOs, through the "Single-cycle I/O" (SIO) interfaces on the Cortex-M0+s. As far as I can tell, this is basically an alternate system bus for simpler peripherals. Basically, transactions to certain addresses get routed to this auxiliary bus, which in our case contains our GPIO data registers.
Since either the PIO or the SIO system can control the GPIO pins, there's some sort of mux on the chip to decide which one is in control at any time. In fact, §1.4.3 of the RP2040 datasheet shows that there can be up to nine different subsystems in control, and each pin can be controlled by a different subsystem at any given time. So I needed to figure out how to configure one of those muxes to let me control a pin using SIO.
The control registers for the muxes themselves are on the main system bus. There are also other "pad" registers that control things like whether a particular GPIO pin is connected to a pullup/pulldown resistor, whether the pin is used for input or output, and various other features.
.set APB_BASE, 0x40000000 .set IO_BANK0_BASE, (APB_BASE + 0x14000) .set GPIO0_CTRL, (IO_BANK0_BASE + 0x004) .set FUNCTION_SIO, 5 .set PADS_BANK0_BASE, (APB_BASE + 0x1c000) .set PADS_GPIO0, (PADS_BANK0_BASE + 0x4) .set SIO_BASE, 0xd0000000 .set GPIO_OE_SET, (SIO_BASE + 0x24) .cpu cortex-m0plus .thumb main: // configure our GPIO pin to be driven by SIO ldr r0, =FUNCTION_SIO ldr r1, =GPIO0_CTRL str r0, [r1] // configure pad options, too ldr r0, =0x0 ldr r1, =PADS_GPIO0 str r0, [r1] // enable SIO output ldr r0, =0xffffffff ldr r1, =GPIO_OE_SET str r0, [r1]
From there, I could go ahead and write to the GPIO data register on the SIO bus, using a simple spin loop as a delay.
.set GPIO_OUT, (SIO_BASE + 0x10) // write to the pin via SIO ldr r3, =0xffffffff // all pins high loop: ldr r1, =GPIO_OUT str r3, [r1] mvn r3, r3 sleep: ldr r0, =0 ldr r1, =0x100000 wait: add r0, r0, #1 cmp r0, r1 blo wait b loop
I was pretty excited by this point and (a few GNU assembler directive syntax refreshers later) ready to start building. I still needed my "boot sector" to have a valid CRC32 like the real bootloader, so I decided to use the SDK's
pad_checksum Python script to generate it. The script takes a file containing the raw data, pads it out to 256 bytes, computes and appends the CRC, and then writes out the result as a second assembler source file, one that just uses data directives, like this:
// Padded and checksummed version of: main.bin .cpu cortex-m0plus .thumb .section .boot2, "ax" .byte 0x0b, 0x49, 0x08, 0x68, 0x0b, 0x4a, 0x53, 0x42, 0x18, 0x40, 0x08, 0x60, 0x0a, 0x49, 0x08, 0x68 .byte 0x18, 0x42, 0xfb, 0xd0, 0x09, 0x48, 0x0a, 0x49, 0x08, 0x60, 0x0a, 0x48, 0x0a, 0x49, 0x08, 0x60 ...
The generated source can then be assembled and linked to produce the actual binary.
The boot ROM exposes two ways of loading a program onto the flash over USB. The more popular route is through UF2: the Pico advertises a USB Mass Storage Class endpoint containing a FAT16 volume. Writing a specially formatted file to this volume causes bytes to be written to the flash. This is pretty clever but too fiddly for my liking. The ROM also exposes a second simpler USB endpoint (under the "vendor-defined" interface class), allowing more basic operations like reading/writing memories and rebooting. The sources call this "PICOBOOT", and there's a nice
picotool program that allows interacting with it.
These USB endpoints are only exposed if the
BOOTSEL button is held while the device is powered up. (Otherwise, the ROM immediately proceeds with the normal boot process.) I got fed up with fiddling with the micro-B USB plug over and over, and I eventually thought to wire up a button to short the
RUN pin to ground. Holding
BOOTSEL while pressing my reset button made the board ready to receive an update.
picotool is able to write to both flash and SRAM, and it looks at the load addresses stored in the ELF metadata to decide which to write to. It's kinda cool to be able to write a non-persistent program directly into the SRAM. That also avoids the 252-byte limit imposed by the normal boot process. On the other hand, the program would disappear when the board is rebooted, and re-running it would require entering
BOOTSEL again to upload it with
picotool. So I wrote my program into flash.
That meant my ELF needed to list the base address of the flash as the start of my code:
$ ld -o blink.elf --section-start .boot2=0x10000000 pad_checksum.o
pad_checksum.o is the assembler output from the padded-and-CRCed "assembly" from
pad_checksum. ".boot2" is the name of the ELF section it placed that source into.
0x10000000 is the base address of the flash; changing this to
0x20000000, for example, would load the program into SRAM instead.
It would be nice to claim that my program worked perfectly the first time, but... it didn't. And my spartan programming environment left me with very little in the way of direct debugging options. I don't own a SWD interface or anything fancy like that, and I didn't have any sort of serial output set up to allow
What I did have, though, was a reference working example in the form of the
blink program in the "pico-examples" repository. Since all of my code was basically "write a value to an address", I could pretty easily replicate it in C, replacing something like
ldr r0, =FUNCTION_SIO ldr r1, =GPIO0_CTRL str r0, [r1]
*(volatile uint32_t *)GPIO_CTRL = FUNCTION_SIO;
I built this quote-unquote C program using the standard CMake setup, and was a bit relieved to see that it worked. That meant I hadn't gotten anything wrong, per se, but rather I hadn't written enough code.
It was around this point I tried experimenting with various schemes to build my assembly file as part of the C program. I first tried calling into my code from
main(), which worked fine. I renamed my assembly symbol to "main" (and removed the C
main). This didn't work, which was pretty perplexing, since it should have only been a single call instruction different from the first example.
In this case, it worked to my advantage that the bootloader and crt0 were all built from source and easily editable. After staring at assembly for what felt like hours, I eventually noticed that the C code that called my assembly looked like
i.e., a normal subroutine call, whereas the code that enters "main" from crt0 is
ldr r1, =main blx r1
Aha: notice the
x on the latter branch instruction. This is a "branch and exchange instruction set" instruction. Basically, if the address being jumped to has its least significant bit set (i.e., is odd), then when the branch is executed:
- the 1 is masked off, and
- the processor is switched into Thumb mode.
If the bit is instead unset, then the processor is switched into regular ARM instructions mode. Note that in both cases, the actual location of the instructions is on the even address; the LSB is just a tag.
I'd written my code as Thumb, matching how the SDK was configured to compile C code (and how its own assembly was written). This was fine for the
bl generated by C, since the processor was already in Thumb mode.
Reading the GNU assembler documentation for ARM, I eventually found the
.thumb_func directive, which sets the LSB of the next symbol. I hadn't been using
.thumb_func. That meant the
blx in the crt0 incorrectly switched the processor to ARM mode before jumping to my Thumb instructions. My Thumb instruction bytes were being interpreted as regular ARM. Who knows what code I was running?
.thumb_func, I was able to get my assembly program to work and blink the LED. But this still meant I was using the bootloader and runtime libraries I was trying to avoid.
From here, it was a trivial (though time-consuming) matter of iteratively disabling sections of the bootloader and runtime, rebuilding, and testing, until I could find what magic they contained that I could replicate.
Ultimately, the magic was this line in
// Remove reset from peripherals which are clocked only by clk_sys and // clk_ref. Other peripherals stay in reset until we've configured clocks. unreset_block_wait(RESETS_RESET_BITS & ~( RESETS_RESET_ADC_BITS | RESETS_RESET_RTC_BITS | RESETS_RESET_SPI0_BITS | RESETS_RESET_SPI1_BITS | RESETS_RESET_UART0_BITS | RESETS_RESET_UART1_BITS | RESETS_RESET_USBCTRL_BITS ));
This was enough of a hint to go back over to the RP2040 datasheet, which spelled things out quite clearly:
Every peripheral reset by the reset controller is held in reset at power-up. It is up to software to deassert the reset of peripherals it intends to use. Note that if you are using the SDK some peripherals may already be out of reset.
In my case, the peripheral that I needed to "un-reset" (i.e., turn on) was the registers controlling the GPIO pins. Like the rest of the program, this mostly just boiled down to writing a few well chosen values to well chosen addresses. In this case, in addition to writing a value to pull the pad registers out of reset, I then needed to read a register to see if the unreset had completed. The datasheet says:
This allows software to wait for this status bit in case the peripheral has some initialisation to do before it can be used.
I have no idea if that guidance practically applies to this block, but it seemed like a good idea anyway.
.set RESETS_BASE, (APB_BASE + 0xc000) .set RESETS_CTRL, (RESETS_BASE + 0x0) .set RESETS_PADS_BANK0, (1 << 8) .set RESETS_IO_BANK0, (1 << 5) .set RESETS_DONE, (RESETS_BASE + 0x8) unreset: // take RESETS_PADS_BANK0 and RESETS_IO_BANK0 out of reset ldr r1, =RESETS_CTRL ldr r0, [r1] ldr r2, =(RESETS_PADS_BANK0 | RESETS_IO_BANK0) bic r0, r0, r2 str r0, [r1] unreset_check_loop: ldr r1, =RESETS_DONE ldr r0, [r1] tst r0, r2 beq unreset_check_loop
With this addition, I could rebuild my program in my minimal, no-bootloader, no-runtime environment, and test it. And, finally, it worked!
Sure—it wasn't exactly the most efficient way to get to a blinking LED. But that wasn't the point anyway. The process taught me a bunch about the Pico hardware, and in the end I was pleased to know that the only code the chip was running was instructions I'd hand-coded.
I put all the code on GitHub.
This is of course a lot like the IBM PC disk boot process—still in use today on some Intel machines—where the ROM will read the first 512 bytes from disk, which must end with the signature
55 aa. ↩︎
Both the bootloader and ROM are open-source—and remarkably legible, as open-source code goes—which was incredibly helpful for figuring this all out! ↩︎
Well, 252, since you lose four bytes to the CRC32. ↩︎
It's a little unfortunate that the bootloader is necessary at all on these boards, given that the flash can be directly executed from. You could imagine a different setup where the ROM simply initializes the flash, points the
VTORregister to the new vector table, and jumps into the flash. I think the reason it's not done this way is to allow for different kinds of flash. The bootloader code has several different versions of its flash configuration code, for different kinds of flash chips.
The flash chip on my board (in my case, a tinier-than-tiny Winbond W25Q16JV) is not going to change, so a nicer design would have encoded that onto the ROM. That would have made things more complicated for third-party boards though. ↩︎
The "boot2" section name is a reference to the normal usage of this script, which is to build the standard Pico bootloader image. I suppose the first stage is the ROM itself. ↩︎
Without actually moving the instructions. The LSB of the program counter is always ignored by the CPU anyway. (Or, equivalently, it doesn't even exist, and the value is shifted left once during fetch. Not sure which.) ↩︎
Except the ROM, and the microcode—oh well... ↩︎