Reducing STM32 SPI HAL Latency

As part of my latest EOGee design, I have four devices on a single SPI bus – two 12-bit ADCs and two 12-bit DACs. Each of them is operating at 30kHz, which means that every 33us I have to send or receive 16-bits of data to each of them.

STMicro provide a HAL library which makes sending SPI data really simple using their commands HAL_SPI_Transmit_IT and HAL_SPI_Receive_IT.

So in theory it should be as simple as calling each of these functions four times sequentially, every 33us. With a clock speed of 10MHz, each transaction should take about 1.6us which would easily fit inside the 33us. However it turns out that the HAL library is very inefficient when it comes to small transactions.

Screen capture showing the timing of the SPI transaction, using the HAL library

The above screen capture shows two traces. The yellow trace shows the SPI clock, SCLK. The green trace shows a GPIO output that goes high when my code dispatches the next SPI packet, and goes low when it receives the transaction complete interrupt. (Due to lazy probing there is a lot of noise between the channels, which can be ignored). What we can see is that the transaction does indeed take about 1.6us, however there is a delay of 5us before the data is sent, and another delay of 6us after the data is sent but before I receive the interrupt. This means each packet takes about 13us which is too long if I am to send four of them every 33us.

So why is there so much delay? I haven’t looked deeply into this, but firstly it is clear that the HAL drivers do a lot more than simply sending the data to the SPI peripheral. It also keeps track of the state of the bus, handles multiple different modes, can perform CRC checks on the data etc. This creates an overhead which isn’t always necessary and is also very wasteful for small transactions. The data also isn’t sent right away – the HAL driver actually enables the SPI bus and then waits for an interrupt requesting more data, rather than sending the data upfront. All of these things add up to create a large delay which in my case is unacceptable.

It was clear that I needed to deviate from the HAL library to get extra speed, but the fact is that the HAL library actually does a lot of useful work as well. So not only does my solution fix this issue, it also makes use of as much of the HAL library as possible.

My code uses the same initialisation functions as the HAL SPI library, which means that you can use STM32CubeMX to configure the peripheral and generate the initialisation code. The only part I have replaced is the transmit/receive function and the interrupt functions. I have also added an extra initialisation step.

void fast_spi_initialise()
	// Enable only the RX interrupt as this will be thrown after 16 bits have been transmitted
	// Enable the SPI. Nothing should happend until we send the fast_spi_transmit function

The fast_spi_initialise function should be called after the HAL SPI initialise function and only does two things. Firstly, it enables the SPI peripheral which means that the peripheral is constantly waiting for data in it’s TXFIFO, ready to send it. Secondly, it enables only the RXNE (Receive Not Empty) interrupt which triggers only when data is received. This is important because this interrupt will be triggered when 16 bits of data (our packet size) have been received, however because we are operating a full-duplex bus, whenever we are receiving we are also transmitting. This basically means that this interrupt can be used as a general “packet complete” interrupt.

void fast_spi_transmit(uint16_t data)
	// coming into this function we assume the SPI is not busy and also that the spi is enabled
	hspi1.Instance->DR = data;

The transmit function is only one line which writes the data into the SPI peripheral data register. Because we enabled the peripheral in the initialisation function, the SPI peripheral is always ready to send data as soon as it is available.

We can also use this transmit function as a receive function. As I mentioned, because we are operating a full-duplex bus, we cannot transmit without also receiving, so we can just send some garbage data to trigger a receive.

Next is to handle the interrupt. This requires modifying some existing code.

void SPI1_IRQHandler(void)
	uint16_t data = hspi1.Instance->DR; //Read the latest 16 bits from the RXFIFO
	fast_spi_rxcallback(data); // send it back to main.c
	return; // We don't want to call the HAL SPI library callbacks so we return before


Here we are inserting some code into the SPI1_IRQHandler function which is called whenever a SPI interrupt is received. Because we only enabled the RXNE interrupt, whenever this triggers we know it must be because we have received some data. At this point we read the contents of the RXFIFO to reset the interrupt and then call our own interrupt function, passing in the data. Finally, we make sure to return from the function before the HAL_SPI_IRQHandler is called so that we bypass the HAL SPI library.

void fast_spi_rxcallback(uint16_t data)
// Do whatever you want

The final piece is the user defined callback that is called from the SPI1_IRQHandler function. This is where the user should define what they want to do with the data that was received.

Screen capture showing the timing of the SPI transaction, using the optimised functions

After doing all of this, we can see that the overhead is significantly reduced. The overall time for one packet to be transmitted is reduced to about 3us compared the 13us initially. This is more than sufficient for my needs so I have not looked closer to see if this could be further reduced.

The changes I have described here only apply to the case where the STM32 is operating as a full-duplex, master SPI device, however similar optimisations could probably be made for most bus architectures, by reading the reference manual and understand how the SPI peripheral works. It is important however not to mix these fast functions with the original HAL library transmit/receive functions because each assumes that they are the only functions calling the bus.

One thought on “Reducing STM32 SPI HAL Latency

  1. i tried your code, now i can able to communication both master and slave using stm32f4, thanks for your valuable post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s