The Microscope for Embedded Code: How Tracealyzer Revealed Our Bug

Sep 23, 2023 |

Tracealyzer. You can’t stay in the wonderful world of debugging and profiling code without hearing the name. If you look at Percepio’s website, it is compared to the oscilloscopes of embedded code. Use it to peek deep inside your code and see what it does. Of course, the code receives an interrupt and checks a CRC before sending the data through SPI, but how does it do it? And how long does it take?

Step one: Add Percepio’s code to your project. You will have to modify your code, but it is simple enough to handle with a few #ifdefs, and can easily be separated from production code. The overhead itself seems minimal, with the computer doing the heavy lifting.

Step two: Run it. You’ll need a large screen for this if you want to see everything; a lot of data is available.

We ran it on a project. It confirmed what we already thought, the MPU was close to 80 % use, but we didn’t have time to delve into the code to see what was hogging up those precious cycles. And we didn’t need to. We noticed that one thread took too much time without even trying. Within a few minutes, we saw why. The thread in question needed to be correctly set up to tell the scheduler it had finished. It’s a quick fix, but only if you know where to look.

Dropping client connections

Then comes the big problem. From time to time, something went wrong. Clients are disconnected, connections are refused, and no debug is available. We immediately thought that something was taking up CPU cycles, but again, we didn’t know where to look. We spent days with a debugger, looking at functions and checking the return codes. Going step by step through the code, we knew the logic was sound, but we needed to understand the timing involved. And so we ran it through Tracealyzer.

The card booted up and came to life. The LEDs blinked, the serial port printed out some standard boot-up information, and everything worked well. And by its side, Tracealyzer did what we wanted it to. Bar graphs showed the boot-up cycle. We could see that a few functions were taking too long to start up, but that would be for another day. That wasn’t the bug we were tracking down right now.

The system stabilized. Bar graphs showed the different context switching, and the idle task kept being called. The system was stable, happily running as intended.

And then it happened. Connections were dropped, and things started to go downhill. A few seconds later, we could connect again, and the system came back up, running as if nothing had happened. This was all we needed. The fault had been seen, and now we needed to know why.

Two and a half seconds of silence

We stopped the system and looked closer at the traces. It’s like playing with a microscope, looking at your fingertip or a leaf’s pores. The more you look, the more you understand. The scheduler had asked to start a task, but another higher-priority task was busy, so it waited patiently. We saw the tasks messaging each other and sending digital information to and from the datastore. The certificate was received and sent to another task. The scheduler wanted the task to run, but a grayed-out zone on the screen indicated that the task had to wait for a few milliseconds, the scheduler wouldn’t let it run. Not yet. Then it started. In the background, the communications task also wanted to start, but it didn’t. The debug task didn’t start either. In all, four or five tasks waited to start, but couldn’t. And then, we lost communication with the cryptography task. That lasted for two and a half seconds before running again and finally leaving a bit of room for its fellow threads to run. The card coughed and wheezed but caught up and ran perfectly again. Interesting.

Let’s put this into context. This application was running on a Cortex-M33, running TrustZone, and we were debugging the Non-Secure zone. When we lost communication with the task, the system clock continued to increment, but nothing happened. That immediately told us that whatever was happening was in the Secure zone. We could see the Non-Secure calling something inside Secure, and then Secure handing back off. Whatever was going on, it was happening inside the Secure zone.

Millions of cycles wasted

We knew the function name, and we knew how to reproduce the problem. Occasionally, we would receive an erroneous signature, which would be compared to some keys hidden deep inside the Secure zone, safely stowed away. However, sending this particular faulty signature would force the Secure zone to run an impossible mathematical calculation, taking up hundreds of millions of cycles until a timeout kicked in, gently reminding everyone that the operation would never complete. When it timed out, the return was simply that the signature was refused, and execution continued. We implemented a check on the Non-Secure side; if the signature was 0x00, then don’t bother sending it to the Secure side; just fail. On the Secure side, now that we knew what function was being called, we found a long-forgotten default timeout of 2500 milliseconds, probably discussed during a coffee break months ago but long since forgotten among the thousands of lines of new code being produced by the team.

From a purely debug and logic point of view, everything worked as planned. Certificates were tested and were either given the green light or refused. Nothing told us that impossible calculations would freeze the system for over two seconds, and at the time, nobody thought of that as a possibility.

You need to understand what your code is doing, but sometimes you also need to see it run, see how it runs, and see if it stumbles. Tracealyzer is just the tool for that.

James Langbridge, CEO, TinyROM