Embedded Python: Cranking Performance Knob up to Eleven! 🐍🥇

How to speed up critical code in your CircuitPython and MicroPython projects and fix performance bottlenecks

10 min readOct 6, 2018

In this post, you are going to learn how to speed up your embedded python code through a real-life example. The post starts with the background story, telling what I wanted to optimize and why. I think it is interesting and even adds some value to the technical part, but if you want to skip it and dive directly into the performance optimizations I did, simply skip to the “Implementing SPI in Python” section.

The Background Story

Electronic DIY conference badges are apparently a thing now. When I started creating PCB art, I knew that one day I will use this technique to create a conference badge. But I couldn’t imagine how quickly this scene was going to explode. Some guy even used my guide for creating this beautiful thing:

Then, a few weeks ago, I was approached by 2 friends who wanted my help in creating a badge for a conferences we will be attending. After some brainstorming, we decided try to use a mesh-capable hardware and implement some mesh network between all the attendees in the conferences, powered by their badges.

When we shared this idea with the conference organizers, this sparked an neat idea in their brain: Let’s have a silent (headphone) party with all the badges! 🔊 🎧🎉

Basically, one server would stream audio data over the air through the mesh network, and all the attendees will tune-in by plugging headphones to their badges. We loved this idea and immediately jumped to researching its feasibility.

CircuitPython 🐍⚡

We decided to go with the nRF52840 chip, as it is energy efficient, and support both Bluetooth Mesh and OpenThread mesh. In addition, it runs CircuitPython, which means the attendees will be able to customize their badges by writing Python scripts and saving them to the badge through USB.

CircuitPython is a MicroPython clone, developed by adafruit industries. Many parts of the code base are very similar, but the Python APIs are somewhat different. It runs on a variety of boards, including the nRF52840 USB Dongle, which sells for $10 a piece. We will need a fair amount of devices for testing, so this price point is just perfect for us.

Let There be Sound! 🔊

After considering several alternatives, we decided to go with a VS1053 chip, as it can decode both MP3 and OGG formats, and stream them directly to headphones. I ordered a breakout board for this chip, and connected it to my nRF52840 Dongle and a small speaker:

My setup: The nRF52840 Dongle, VS1053 breakout and a small speaker

I sat down and wrote a small Python driver for the VS1053, so I could send files from my code. A few hours later, I could play MP3 files from the Flash memory! What a joy! 😃

Next step — streaming the MP3 data over Bluetooth. This is where the problems started. The VS1053 communicates using the SPI protocol (more on that soon), and apparently CircuitPython has a bug where using hardware SPI while Bluetooth is on crashes the CPU. Nasty. I tried using their software SPI module, but it also didn’t work for some reason.

In hope to complete this small proof-of-concept quickly, I decided to hack together my own software SPI implementation in Python.

Implementing SPI in Python

SPI is a pretty simple protocol — there is a master (in this case, the chip my code was running on), and a slave (the MP3 decoder circuit). When sending data to a slave (which was what I was interested in), there are two relevant data pins: SCLK (the clock), and MOSI (Master-Output-Slave-Input).

Sending a byte to the slave simply involved shifting the bits out the MOSI pin one by one, while toggling SCLK low before shifting the next bit, and then back high as soon as the bit is a available. This bits are usually sent in MSB first order (highest bit first). This is what it looks like in pseudo code:

For each bit B in value_to_send:
    SCLK ← 0
    MOSI ← B
    SCLK ← 1

When converted to CircuitPython dialect, the code looked like:

Lines 3–6 implement the pseudo code above

I saved my software SPI implementation to the USB drive, and quickly ran the code that plays the MP3 file from the Flash memory, but the result was not pleasing — the sound was playing very very slowly and choppy. I tested it with a 8.4 seconds audio file, and it took the code 128 seconds to send it over SPI. That was not even close to workable. Disappointed from the results, I decided to call it a day.

The Hunt for Performance

The next day, I told my friends about my failed experiment with the Software SPI, when one of them pointed me at an amazing talk by Damien George, the creator of: Writing fast and efficient MicroPython. I watched it today, and decided to apply the tricks presented there, hoping to achieve an acceptable performance for my Python SPI implementation.

I started by reusing the same memory buffer when reading the audio file to memory, instead of reallocating a new buffer on each read — but it didn’t improve the performance significantly. The bottle neck was in the sendByte() method above — the code in the loop runs several Python instructions for each and every bit in the input data. In other words, sending the file through SPI causes each line there to execute over one million times!

Loop Unrolling ➰

Loop unrolling is a well known trick when trying to squeeze more performance out of your code: simply remove loops and repeat the instructions by hand:

Then code became longer and somewhat harder to follow, but the execution time went down to just 101 seconds! Hooray 🎺

The Native Magic ✨

Another quick win mentioned by Damien was simply adding a @micropython.native decorator for the sendByte() method. Unfortunately, this didn’t seem to work on CircuitPython, as it shouted at me invalid micropython decorator. A quick search through the CircuitPython source code led to finding the culprit: the Native functionality was not enabled for the nRF port. I went to the configuration file for this port, changing it to read:

#define MICROPY_EMIT_THUMB (1)

I recompiled the firmware, uploaded it, finger crossed 🤞 !

No crash 💥, code is running… it suddenly recognizes the decorator, and the execution time goes down to 49.5 seconds. Sometimes a single line of code can do wonders!

Avoiding Expensive Lookups

Damien also explained that any code that references self, a global variable, or a value inside an object is expensive, as the python run time has to go and look for this variable in a dictionary. He advised to save any repeating references in a local variable to speed up things. Thus, I tried to cache self.clk and self.MOSI at the beginning of the function:

This time, the code that plays the file ran in just 34 seconds. Still far from smooth playback (mind that the audio file is 8.4 seconds long, so we need to send it faster than that), but the trend looks promising!

Note that this Native Decorator comes with some limitations — for instance, you can’t use with statements inside the decorated function.

The “Killer” Feature — Viper! ☠🐍

So far, all of the changes we did were in python land. If you know Python, the syntax should look familiar. Viper is another decorator that you can use to speed up your code — but this time, you have to rewrite it in a hybrid dialect of Python that has types(!) and pointers. It imposes even more limitations: the optimized functions can’t receive more than 4 arguments, and these arguments can’t have any default values.

Luckily, in our case, sendByte() has only one argument with no default value, so this is not a real problem. But simply changing the decorator on top to read @micropython.viper doesn’t do the trick. The code compiles, but we get an error on Runtime:

File "vs1053.py", line 86, in writeRegister
  File "main.py", line 75, in write
AttributeError: 'DigitalInOut' object has no attribute 'value'

So controlling the output of these pins using the value attribute doesn’t work in Viper mode. CircuitPython has another way of changing the output value of pins, the switch_to_output() method, but using this method actually made things worse — the execution time went back up to 45 seconds. Not good :/

Going Low-Level 👩‍💻

Luckily, Viper mode holds another trick under its sleeve — it lets us talk directly to the hardware by using pointers and memory addresses, thus bypassing all the abstraction layers included in the platform.

However, in order to control these GPIO pins directly, we need to find the memory addresses that control them. For that, I consulted the datasheet for the nRF52840 CPU, where I found the GPIO registers in section 6.9, titled “GPIO — General purpose input/output” (page 141).

Section 6.9.2 (page 143) lists the memory addresses for the GPIO registers. the nRF82540 has two sets of GPIO pins, called GPIO0 and GPIO1 respectively. In my circuit, I have the following connections:

SCLK connected to P0_22 (pin number 22 on GPIO0)
MOSI connected to P1_00 (pin number 0 on GPIO1)

Thus, I needed to control both GPIO ports. Every port has a base address listed in the datasheet: GPIO0 is at 0x50000000, and GPIO1 is at 0x50000300.

When you want to ask the hardware to perform an action on a GPIO pin (or query it), you do it by writing to or reading from a register. In our case, there are two interesting registers:

OUTSET, at offset 0x508, which turns an output pin high (digital value 1)
OUTCLR, at offset 0x50c, which turns an output pin low (digital value 0)

How do you write to this registers? First, you add the base address of the GPIO port to the offset of the register. Assume that we want to write the value 1 to a pin on GPIO1. We can do it by adding the base address of GPIO1, 0x50000300, with the offset of the OUTSET register, giving us the value
0x50000808. This is the memory address we need to write to in order to set pins on GPIO1 to high.

To choose which specific pin(s) you want to turn on, you write a value with the respective bit sit. So if you want to set the first pin, you will write a value with the leftmost bit left, or 0x01. The the 4th pin, you want to 4th leftmost bit, so that’d be 0x08. To get this value in python, you’d right: 1 << pin, where pin contains the index of the target pin, starting from 0.

To conclude, in order to set MOSI (P1_00) high, you write the value 1 << 0 at the memory address 0x50000808. To set it low, you write the value 1 << 0 at the memory address 0x5000080c (this is the OUTCLR register — note the c at the end).

Similarly, setting SCLK (P0_22) high would involve writing 1 << 22 to 0x50000508 (that’s OUTSET of GPIO0), and low by writing the same value to 0x5000050c (that’s OUTCLR of GPIO0).

In Python code that’d be:

Writing directly to hardware registers in CircuitPython and MicroPython

You can read more about ptr32 in Viper’s documentation. Basically, the [0] subscript tells python that you want to write to / read from the memory address pointed by this object.

I changed my sendByte() method to take advantage of the above commands:

Note that at this point, the function has become virtually unreadable, a maintainance nightmare, and it is also now tied to the specific pins I used. But does it at least run faster?

This time, the audio file was sent through SPI in just 11.7 seconds. Nearly there, but still not enough for a smooth playback. Luckily, though, we can repeat the trick with caching references. Instead of repeating all these ptr32(...) expressions, we can save them to a local variable, thus saving precious time:

This code even looks somewhat more readable. I saved it to the CircuitPython USB drive, the code ran — and the sound played smoothly! 🎉🎊

Hooray, this time sending the audio file over SPI took mere 6.52 seconds, which means I have had some spare CPU cycles for the Bluetooth radio to stream the MP3 data into my code.

At this point, I decided to bring back the loop, in favor of code clarity:

It brought the time up to 7.08 seconds — slower, but still fast enough for my use-case, but the code is much easier to follow.

It worked! the MP3 file played flawlessly using the optimize software SPI code

Great Success!

I started the day with a software SPI implementation that consumed 128 seconds of CPU time to send a small audio file to the player, and went through several steps of optimization, until eventually I optimized the bottleneck option enough to stream the same file in just 7 seconds and get smooth playback.

I could probably spend more time optimizing the other functions in my code, perhaps also using the Inline Assembler, but at this point I focused back on my original goal: seeing if I could stream the audio from my computer to the mp3 player circuit over BLE (Bluetooth Low Energy) using CircuitPython. The new software SPI implementation allowed my code to speak to the player even while Bluetooth was active.

I encountered another roadblock as the Bluetooth bandwidth was not sufficient to stream the data fast enough (at around 2.6kb/sec), which I worked around by fiddling with the MTU. But that’s a story for another time.

Long story short: It worked! 😊

When I played the audio file from the Flash memory, I could only hold a short clip, as there was no much space on Flash. However, now, when I had the ability to stream, I was finally able to stream the entire file to the device, over the air, from my computer:

Ariella Eliassaf tells my computer to stream an MP3 file to the CircuitPython code, and it plays! 🎶

I posted the final implementation of the software SPI, as well as a GitHub repo containing the complete code I used (yes, including the BLE streaming part).

p.s. Wondering where this audio file comes from? It is a short promo I recorded for one of our Salsa parties, back when we used to have a Salsa dancer club. 🕺💃

This is my 6th post in my Postober challenge — writing something new every single day throughout October.

I will tweet whenever I publish a new post, promise! ✍