FastSPI_LED2 is a rebuild from the ground up using multiple layers of components that, it turns out, may have usages outside of just the LED library! I’ve already described the FastPin library that drives FastSPI_LED2 at its lowest levels. Now it is time to talk about the next layer of components in the library, FastSPI, the piece that gives the library the SPI part of its name and functionality.

As with the Pin class, this portion of the library at the moment is optimized/tuned for writing data. A near future revision will include support for reading as well. For now, though, if you want to pump data out over SPI, and not worry about the specifics of hardware vs. software bitbanging, or pin twiddling, and you want it to be as high performant as possible, this library may be useful to you. As an end user of the library, using the code is fairly simple. Here’s an example:

#include "FastSPI_LED2.h"
SPIOutput<11, 13, 10, 0> SPI;
setup() { SPI.init(); }
loop() { 
  // setup some data in a buffer
  SPI.writeBytes(pData, nBufferLen);
}

Pretty straightforward, no?  The template parameters to the SPIOutput definition are the data pin, clock pin, latch pin, and speed identifier, which is a value from 0-255.  If you happened to select a set of pins that maps to the hardware SPI on the platform you are building for, the library will transparently use that, otherwise, the library will attempt to bitbang using the highest performing options available to it from the Pin library discussed in the FastPin post  (there is a third mode, that has been coded, but not yet tested/debugged, and is currently disabled, which is to abuse the USART that’s available on some platforms in SPI mode to give you a different set of pin options for hardware SPI output).  Of course, what it does under the hood involves a little bit more code, but code that attempts to pull the most out the platform you’re building for, and again, attempting to save you the trouble of having to deal with most of it.

If you instantiate SPIOutput with a set of pins that lines up to a known set of pins with hardware SPI available under the hood, then what actually gets instantiated is a class called AVRSPIHardwareOutput.  This is a class whose init function does the AVR SPI initialization with the SPCR/SPSR registers.  Then, the writeBytes function is basically a tight loop over your bytes of data pushing out a byte at a time using the SPDR register.  The internal implementation also plays some games with delays to actually speed things up.  How does that work?  Simple, delaying a couple of cycles before checking the status register for writing out the next byte increases the odds that the check will be ready on the first time through the loop, rather than it becoming ready part-way through the loop, requiring one full cycle of the loop where it isn’t needed.  Some of this tuning allows getting over 6.5Mbps real world data output.  I suspect it would be possible to pull even more performance out playing some games with loop unrolling of various types.  Things that will be investigated with future library revisions.

Where things get interesting in the code is with the software bitbang output.  The current version of the code can get over 2.5 Mbps data pushed out using software.  Here’s a rough description of the games played in the library.  If the code knows for sure that no other output is going to be going on while writing out SPI data (which is the default assumption, for the moment), then there’s a handful of specific optimizations that we can make, that are slightly different based on whether the clock/data pins are on the same port or separate ports.

First, we’ll look at if they’re on separate ports.  Before we start writing data out, there’s 6 pieces of information that we cache.  The port address for the data pin (dataPort), the port address for the clock pin (cloclPort), the value we want to write to the data port for a hi data value (dataHi), the value we want to write to the data port for a lo data value (dataLo), the value we want to write to the clock port for a hi clock value (clockHi), and the value we want to write to the clock port for a low clock value (clockLo).  Once we have that, we will iterate over all the bytes, and for each bit, in each byte, there’s code that looks like the following:

		if(b & (1 << BIT)) {
			Pin<DATA_PIN>::fastset(dataPort, dataHi);
		} else { 
			Pin<DATA_PIN>::fastset(dataPort, dataLo);
		}
		Pin<CLOCK_PIN>::fastset(clockPort, clockHi);
		Pin<CLOCK_PIN>::fastset(clockPort, clockLo);

So, for each bit, we do a check to see whether the bit is hi or lo. Then there is a write to the data pin, appropriately the precached hi/lo value. Then a pair of writes to the clockPort for the hi/lo pin values. Now, recall that the Pin library goes for direct port access when it can, allowing us to use the 1 cycle out opcode instead of the 2 cycle store opcode. If the Pin library can, then the value for the data/clockPort fields is silently ignored (and the compiler will take care of removing all, now unnecessary, referneces to it). So, best case, for each bit we have the cost of the “is the bit set or not” check, and then 3 write operations, either at 3 cycles if we can use out, or 6 cycles if we’re using st.

Why the dataHi/dataLo values? Well, the normal way to do something like this would be to say something like “dataPort |= Pin<DATA_PIN>::mask()” – basically or’ing hi the specific bit that maps to our pin. However, that operation either requires a 2 cycle load, the or, and a 2 cycle store, or if we have direct out registers a 1 cycle load, the or, and a 1 cycle store. For each bit. Ew. However, remember above when I made the assumption that no other pins were going to be getting written? If that is the case, and data and clock are on different ports, then I can pre-cache the values for |= (to set hi) and &=~ (to set lo). That gets rid of the need to do the load and the or for each bit. This dance is a little bit easier on the ARM platform where, in addition to the equivalent to AVR’s PORT register, there’s also specific SET/CLEAR ports, where you can write a mask value to those ports, and only the 1′s in the mask have their state set/cleared, the other pins get ignored, saving the need to do the load and the and/or.

So, we’re down to, best case, 2 clocks for the check/jump, and then 3 clocks for actually setting the pin values. However, this library isn’t called FastSPI for nothing. There’s a way we can still go faster, peeling off another cycle or 2. How? Well, if both the clock and data pins are on the same port then we only need to pre computer/stash in registers, five values:

  • the port register for the data/clock pins (dataPort)
  • the value for the register with the data pin hi and the clock pin hi (dataHiClockHi)
  • the value for the register with the data pin hi and the clock pin lo (dataHiClockLo)
  • the value for the register with the data pin lo and the clock pin hi (dataLoClockHi)
  • the value for the register with the data pin lo and the clock pin lo (dataLoClockLo)

(Beginning to see where this is going to go?)  Now, with these five values stashed away in registers, we have code that looks like this:

		if(b & (1 << BIT)) {
			Pin<DATA_PIN>::fastset(dataPort, dataHiCLockHi);
			Pin<DATA_PIN>::fastset(dataPort, dataHiCLockLo);
		} else { 
			Pin<DATA_PIN>::fastset(dataPort, dataLoCLockHi);
			Pin<DATA_PIN>::fastset(dataPort, dataLoCLockHi);
		}

Now, for each bit we have our couple cycles for whether the bit is set or not, then only -two- writes total. One that sets the data pin hi/lo appropriately and the clock hi, and one that keeps the data pin set appropriately and drops the clock lo.  Combine some creative (ab)use of templates and loop unrolling to try and minimize as much overhead as we can, and we get quite a bit of performance squeezed out of the world here.  There is still some room for me to squeeze even more performance out of this.  However, 2.7Mbps at the moment is a nice level to have hit, and there were other pieces in the library that needed love.

Not to worry though, I will come back to this and make this process even more efficient!  (I already have a plan for how to do this, however executing it efficiently will require working around the fact that gcc is flat out stupid when it comes to generating code for switch statements, so there’s some more asm level work i need to shake out to make this fully happy).

Enjoy! Next up

About these ads