Latest Entries »

I know there has been quite a gap of time since the last preview release for FastSPI_LED2, almost two months!  On the one hand, I would like to apologize for the gap in time, especially since there were a handful of bugs floating around there (looking at you WS2801 and Arduino analog pins!).  On the other hand, there have been a lot of changes and editions in the time since then so grab your preferred beverage, do the clicky on the link to go start the download of the release candidate, and settle in to read all about what’s new.

A what?

First, though, a quick definition.  What is meant by release candidate?  This is basically the code in a fairly close to what we’ll publish/release format, barring any last minute bug fixes.  Any features (new chipsets, new platforms, new techniques, new [REDACTED]) will just have to wait until the next release.  So what will happen between now and the release?  Mostly we’ll keep an eye out for (and focus on fixing) any catastrophic bugs that people run into, and we’ll work on documentation and examples to play with, and I will take some time to shift my primary focus over to a large led art installation going up in early july.

Fine, now spill!

As hinted above, there is a lot in this release candidate.  A whole lot new.  FastSPI_LED2 had a number of goals laid out for it.  The first was to continue to be a straightforward to use high performance multi-led chipset interface library, a no brainer, we even put it in the name.  How to expand on that?  Let’s add multi-platform support (starting with the arm-based teensy 3.0), done in a way that will make porting to even more platforms (due, chipkit, beagle, CoAction, etc…) easier, much much easier.  Also, let’s apply “fast”, not just to the library itself, but to how quickly someone can get up and running using it.  People following the google+ community have seen some hints of what we’re talking about here.  Also, we’re going to add “fast” to the speed with which we can add support for new chipsets.  Also, we’re going to add “fast” to more of your led related programming, and maybe even beyond that (i.e. not just fast, but more), but we’ll get to that eventually.

The new library has nearly 10 times the amount of code.  However, don’t worry!  It compiles down far smaller than the old library.  Sample applications have gone from nearly 13kb of precious program flash down to 2600 bytes.  How do we do this?  Well, those details we’ll spill more on in the future.  We’re planning a series of posts digging down, in further detail, the various ways we’ve made our library do the thing we’re doing.  As with fast, however, not only is the compiled code going to be smaller, but the amount of work that you would need to do is also going to be smaller.  How much smaller?  Check this out:

#include <FastSPI_LED2.h>
#define NUM_LEDS 150

void setup() { LEDS.addLeds<WS2801>(leds, NUM_LEDS); }
void loop() { LEDS.showColor(CRGB::Blue); }

How about that?  No more multiple sets of initialization methods.  Also, no longer limited to one item.  You could have multiple sets of leds, even with different controllers!  And your options for configuring and tweaking those controllers have also been greatly expanded, but again, more on that later.  You will also notice that you no longer need to define your own CRGB structure.  This isn’t just cosmetic.  You will note that not only does the CRGB structure provide a way to define your set of leds, but it also provides named colors to work with!  We have all the HTML colors included in there.

You keep saying we, what’s up with that?

Ah, attention has been paid.  Yes, there are now two of us working on FastSPI_LED2.  Mark Kriegsman has joined in with a wide swath of new toys to add to the LED programming world.  As much of an optimization fanatic as I am, if not more so, he also brings a passion for the art, and for the color.  Since we’re talking about him here and now, we’ll dig into some of large contributions that he’s making to FastSPI_LED2.

In the 2.5 years that the library has been out there people have done some amazing, incredible, beautiful things, using the library as a base, including the two of us.  It is clear to us that, for LED programming, there is more to life than simply pushing rgb values out to the strips.

For one there is navigating the color space.  A lot of work is doable just playing with RGB, both simple and advanced, and the addition of the named colors shown above makes it even easier for people to work and play with.  However, we found ourselves playing in the HSV color space a lot, mainly because this provides for some really really easy ways to navigate around colors.  It started out with me finding a library to convert HSV to RGB, and then as I tend to do, optimizing it to get better performance out of it.

Mark started working with me on this, and he started out by tuning the HSV to RGB conversion to be even smaller and faster.  Then it turned into a game, with each of us making tweaks to squeeze a few more clock cycles and bytes out of the thing (he won – at just about 1µs to convert HSV to RGB on the teensy 3).  However, that wasn’t enough for him.  Mark dug into the theory behind the HSV color space, as well as the theory and biology behind the human eye’s perception of color and brightness.  He tweaked and massaged the library further, keeping it fast but tweaking it around so that the colors looked smoother, and more natural, and better fit to our eyes.

In fact, it looks so good, and so smooth, our default color conversion uses this more natural, more balanced, system.  Of course, since we believe in choices, and speed, if you want, for your project, to use the raw, uncorrected conversion as fast as you can get it, we’ll give you that too.

Did Mark stop there, though?  No!  Of course not.

Lib8tion, or “pour yourself another”

One thing I discovered in doing LED projects with a variety of chipsets from a variety of suppliers, I discovered that every supplier has a different opinion on what the proper order of “RGB” is.  Some think it should be BGR, some think it should be GRB, one even thought it should be  ЯᗺG (thankfully, no one thought it should be B0RG, or it’d all be futile!).  Previously, this was dealt with by people changing the order of R, G, and B in their struct.  However, as mentioned above, we have these lovely hsv(ish) to RGB converters.  Those work a lot better if they know what the RGB type is and the proper ordering.

To handle this, and also make it easier for people to pair up strips with different orderings on different pins, we moved specification of a strip’s RGB ordering to the point where you add a controller, and made a standardized CRGB class.  Of course, since we’re playing in C++, we can put methods on the class, and oh did we.  The full documentation is coming up, but you can do things like add rgb colors, multiply them to shift scale, and more!  Of course, as with all things, we want this to be fast, and did Mark ever deliver.  Since RGB values are 8 bit, he pulled together optimized 8 bit versions of a number of operations (multiply, divide, scaling, saturating add (e.g. add two numbers and the value is 255 if the sum would be > 255, useful when dealing with RGB values!)), and added those to the class and world.

Of course, we immediately looked for other places to put this.  One is, his scaling function is fast enough that the FastSPI_LED2 show functions now all take an optional brightness argument and will scale the brightness as it writes to the led strip with no increased overhead (except for one limited case, which we’ll document, but won’t affect most of you).  Let me say that again.  Global brightness FOR FREE.  Order  now, and we’ll throw global dimming for free too!

There’s more (you didn’t think we were done, did you?).  A whole library of interpolation, easing, random number generators, optimized memory operations, sin/cos.  You don’t need us to explicitly apply “fast” to all of those, do you?

You said fast, what about the leds?

You will be able to be fast with getting the colors that you want all of your leds to be.  Of course, what kind of leds are you thinking of?  We’ve got WS2801, LPD8806, WS2811/2, TM1809/4s, TM1803, UCS1903, SM16716, DMX controlled.  You want to preserve your hardware SPI port for your SD card reader?  Put your WS2801 on a different set of pins and the library will automatically fall back to high performance bit-banged output (2.5Mbps on 16Mhz arduino, over 5Mbps on a 48Mhz teensy 3.0).  Put multiple strips on your controller.  Of different types, even!  Also, you didn’t think we’d forget about the fast or the spi, did you?  On the teensy 3 you can push those LPD8806s at nearly 22Mbps.  If you don’t feel like playing with math at home, that’s 150 leds at over 6000 fps.  Let that sink in for a moment.  Sure, you want to do something other than push LED data, so let’s call it 3000 fps and you can spend the rest of the time thinking about what you want to show on the strip for 0.0003 seconds.

But wait, there’s more!

Alas, not in this release, there isn’t.  We do have a long list of things that will trickle out in updates over the next few months (after we take a bit of a break).  I have at least four more chipsets sitting on my desk at home, we’re investigating 16bit representations of rgb/hsv color data for insanely smooth color transitions, we’re going to push as much of the led writing to DMA on the platforms where we can (oh, hey, look, back to 6000 fps, and you get ALL of your cpu time back!), we’re going to add platforms (we’re looking at you due!), we’ve got some ideas for even more efficient support of multiple strips of certain types of leds, we’re going to add more code to support making things bright, pretty, flashy, and we’re going to keep the code as fast as we can to stay as much out of your way as possible.

And we haven’t even started talking about [REDACTED]!

As Mark said tonight, “Let a thousand pixels bloom”

The second preview release for FastSPI_LED2 is up!  Check it out on the downloads page – changes/updates include:

  • Renamed latch references to select (more accurate for what’s being done)
  • Renamed Pin class to FastPin (unlikely to be an issue unless you were directly referencing pins)
  • Add SM16716 support
  • Add pin definitions for the Arduino Mega, Teensy++2.0, and Arduino Leonardo
  • Add some useful warnings when definitions are missing

In addition – i’ve set up a google+ community for discussion/support and ongoing announcements and updates for the library.

There’s some more changes that I’d like to get put in and wrapped up before I move this library from preview to final releases, including:

  • Arduinoe Due support
  • DMA support on Teensy 3.0
  • Add APA102 chipset support
  • Provide more example code
  • Allow handling strips that confuse their RGB ordering

FastSPI_LED2 is a rebuild from the ground up using multiple layers of components that, it turns out, may have usages outside of just the LED library! I’ve already described the FastPin library that drives FastSPI_LED2 at its lowest levels. Now it is time to talk about the next layer of components in the library, FastSPI, the piece that gives the library the SPI part of its name and functionality.

As with the Pin class, this portion of the library at the moment is optimized/tuned for writing data. A near future revision will include support for reading as well. For now, though, if you want to pump data out over SPI, and not worry about the specifics of hardware vs. software bitbanging, or pin twiddling, and you want it to be as high performant as possible, this library may be useful to you. As an end user of the library, using the code is fairly simple. Here’s an example:

#include "FastSPI_LED2.h"
SPIOutput<11, 13, 10, 0> SPI;
setup() { SPI.init(); }
loop() { 
  // setup some data in a buffer
  SPI.writeBytes(pData, nBufferLen);

Pretty straightforward, no?  The template parameters to the SPIOutput definition are the data pin, clock pin, latch pin, and speed identifier, which is a value from 0-255.  If you happened to select a set of pins that maps to the hardware SPI on the platform you are building for, the library will transparently use that, otherwise, the library will attempt to bitbang using the highest performing options available to it from the Pin library discussed in the FastPin post  (there is a third mode, that has been coded, but not yet tested/debugged, and is currently disabled, which is to abuse the USART that’s available on some platforms in SPI mode to give you a different set of pin options for hardware SPI output).  Of course, what it does under the hood involves a little bit more code, but code that attempts to pull the most out the platform you’re building for, and again, attempting to save you the trouble of having to deal with most of it.

If you instantiate SPIOutput with a set of pins that lines up to a known set of pins with hardware SPI available under the hood, then what actually gets instantiated is a class called AVRSPIHardwareOutput.  This is a class whose init function does the AVR SPI initialization with the SPCR/SPSR registers.  Then, the writeBytes function is basically a tight loop over your bytes of data pushing out a byte at a time using the SPDR register.  The internal implementation also plays some games with delays to actually speed things up.  How does that work?  Simple, delaying a couple of cycles before checking the status register for writing out the next byte increases the odds that the check will be ready on the first time through the loop, rather than it becoming ready part-way through the loop, requiring one full cycle of the loop where it isn’t needed.  Some of this tuning allows getting over 6.5Mbps real world data output.  I suspect it would be possible to pull even more performance out playing some games with loop unrolling of various types.  Things that will be investigated with future library revisions.

Where things get interesting in the code is with the software bitbang output.  The current version of the code can get over 2.5 Mbps data pushed out using software.  Here’s a rough description of the games played in the library.  If the code knows for sure that no other output is going to be going on while writing out SPI data (which is the default assumption, for the moment), then there’s a handful of specific optimizations that we can make, that are slightly different based on whether the clock/data pins are on the same port or separate ports.

First, we’ll look at if they’re on separate ports.  Before we start writing data out, there’s 6 pieces of information that we cache.  The port address for the data pin (dataPort), the port address for the clock pin (cloclPort), the value we want to write to the data port for a hi data value (dataHi), the value we want to write to the data port for a lo data value (dataLo), the value we want to write to the clock port for a hi clock value (clockHi), and the value we want to write to the clock port for a low clock value (clockLo).  Once we have that, we will iterate over all the bytes, and for each bit, in each byte, there’s code that looks like the following:

		if(b & (1 << BIT)) {
			Pin<DATA_PIN>::fastset(dataPort, dataHi);
		} else { 
			Pin<DATA_PIN>::fastset(dataPort, dataLo);
		Pin<CLOCK_PIN>::fastset(clockPort, clockHi);
		Pin<CLOCK_PIN>::fastset(clockPort, clockLo);

So, for each bit, we do a check to see whether the bit is hi or lo. Then there is a write to the data pin, appropriately the precached hi/lo value. Then a pair of writes to the clockPort for the hi/lo pin values. Now, recall that the Pin library goes for direct port access when it can, allowing us to use the 1 cycle out opcode instead of the 2 cycle store opcode. If the Pin library can, then the value for the data/clockPort fields is silently ignored (and the compiler will take care of removing all, now unnecessary, referneces to it). So, best case, for each bit we have the cost of the “is the bit set or not” check, and then 3 write operations, either at 3 cycles if we can use out, or 6 cycles if we’re using st.

Why the dataHi/dataLo values? Well, the normal way to do something like this would be to say something like “dataPort |= Pin<DATA_PIN>::mask()” – basically or’ing hi the specific bit that maps to our pin. However, that operation either requires a 2 cycle load, the or, and a 2 cycle store, or if we have direct out registers a 1 cycle load, the or, and a 1 cycle store. For each bit. Ew. However, remember above when I made the assumption that no other pins were going to be getting written? If that is the case, and data and clock are on different ports, then I can pre-cache the values for |= (to set hi) and &=~ (to set lo). That gets rid of the need to do the load and the or for each bit. This dance is a little bit easier on the ARM platform where, in addition to the equivalent to AVR’s PORT register, there’s also specific SET/CLEAR ports, where you can write a mask value to those ports, and only the 1’s in the mask have their state set/cleared, the other pins get ignored, saving the need to do the load and the and/or.

So, we’re down to, best case, 2 clocks for the check/jump, and then 3 clocks for actually setting the pin values. However, this library isn’t called FastSPI for nothing. There’s a way we can still go faster, peeling off another cycle or 2. How? Well, if both the clock and data pins are on the same port then we only need to pre computer/stash in registers, five values:

  • the port register for the data/clock pins (dataPort)
  • the value for the register with the data pin hi and the clock pin hi (dataHiClockHi)
  • the value for the register with the data pin hi and the clock pin lo (dataHiClockLo)
  • the value for the register with the data pin lo and the clock pin hi (dataLoClockHi)
  • the value for the register with the data pin lo and the clock pin lo (dataLoClockLo)

(Beginning to see where this is going to go?)  Now, with these five values stashed away in registers, we have code that looks like this:

		if(b & (1 << BIT)) {
			Pin<DATA_PIN>::fastset(dataPort, dataHiCLockHi);
			Pin<DATA_PIN>::fastset(dataPort, dataHiCLockLo);
		} else { 
			Pin<DATA_PIN>::fastset(dataPort, dataLoCLockHi);
			Pin<DATA_PIN>::fastset(dataPort, dataLoCLockHi);

Now, for each bit we have our couple cycles for whether the bit is set or not, then only -two- writes total. One that sets the data pin hi/lo appropriately and the clock hi, and one that keeps the data pin set appropriately and drops the clock lo.  Combine some creative (ab)use of templates and loop unrolling to try and minimize as much overhead as we can, and we get quite a bit of performance squeezed out of the world here.  There is still some room for me to squeeze even more performance out of this.  However, 2.7Mbps at the moment is a nice level to have hit, and there were other pieces in the library that needed love.

Not to worry though, I will come back to this and make this process even more efficient!  (I already have a plan for how to do this, however executing it efficiently will require working around the fact that gcc is flat out stupid when it comes to generating code for switch statements, so there’s some more asm level work i need to shake out to make this fully happy).

Enjoy! Next up

FastSPI_LED2 is a rebuild from the ground up using multiple layers of components that, it turns out, may have usages outside of just the LED library!  The lowest level of these is the Pin access library.  This is designed to allow me to write higher level code accessing pins, using the fastest mechanisms available on known platforms, and falling back to workable methods on arduino platforms that don’t have all the information in place for Mostest Speed[tm].  The goal with the Pin class is to make pin access as easy/trivial as possible, and also as portable as possible.  For example, here’s a simple bit of code to blink a pin:

#include "FastSPI_LED2.h"

setup() { Pin<13>::setOutput(); }
loop() { Pin<13>::hi(); delay(200); Pin<13>::lo(); delay(200); }
// even shorter loop() { Pin<13>::toggle(); delay(200); }

and that’s it!  What you don’t see, though, is the mechanism that is used under the hood.  Most introductory arduino code recommends using methods like digitalWrite – which has a lot of overhead for turning a pin on or off.  While that may not be important for a simple example like this, with something like FastSPI_LED2, high performance is part of the name and the game!  The Pin class/library, under the hood, tries to use the most efficient/speedy method for twiddling the pin that it can.

For example, on an AVR, the hi() call compiles down to a single avr operation, which runs in a single clock cycle, seen in the below disassembly:

000000aa : <loop>
  aa:   2d 9a           sbi     0x05, 5 ; 5

On an arduino where I haven’t yet defined pin mappings, this code looks more like:

00000100 : <loop&gt;
 100:   e0 91 0a 01     lds     r30, 0x010A
 104:   f0 91 0b 01     lds     r31, 0x010B
 108:   80 81           ld      r24, Z
 10a:   90 91 09 01     lds     r25, 0x0109
 10e:   89 2b           or      r24, r25
 110:   80 83           st      Z, r24

A few more instructions, part of why i’m working on making sure I get pin definitions in for as many platforms as possible.  Still though, better than using digital write which becomes:

00000100 <loop>:
 100:   8d e0           ldi     r24, 0x0D       ; 13
 102:   61 e0           ldi     r22, 0x01       ; 1
 104:   0e 94 b5 01     call    0x36a   ; 0x36a <digitalWrite>

and that digital write code? Well, let’s take a look at the disassembly of digitalWrite:

0000036a <digitalWrite>:
 36a:   48 2f           mov     r20, r24
 36c:   50 e0           ldi     r21, 0x00       ; 0
 36e:   ca 01           movw    r24, r20
 370:   82 55           subi    r24, 0x52       ; 82
 372:   9f 4f           sbci    r25, 0xFF       ; 255
 374:   fc 01           movw    r30, r24
 376:   24 91           lpm     r18, Z+
 378:   ca 01           movw    r24, r20
 37a:   86 56           subi    r24, 0x66       ; 102
 37c:   9f 4f           sbci    r25, 0xFF       ; 255
 37e:   fc 01           movw    r30, r24
 380:   94 91           lpm     r25, Z+
 382:   4a 57           subi    r20, 0x7A       ; 122
 384:   5f 4f           sbci    r21, 0xFF       ; 255
 386:   fa 01           movw    r30, r20
 388:   34 91           lpm     r19, Z+
 38a:   33 23           and     r19, r19
 38c:   09 f4           brne    .+2             ; 0x390 <digitalWrite+0x26>
 38e:   40 c0           rjmp    .+128           ; 0x410 <digitalWrite+0xa6>
 390:   22 23           and     r18, r18
 392:   51 f1           breq    .+84            ; 0x3e8 <digitalWrite+0x7e>
 394:   23 30           cpi     r18, 0x03       ; 3
 396:   71 f0           breq    .+28            ; 0x3b4 <digitalWrite+0x4a>
 398:   24 30           cpi     r18, 0x04       ; 4
 39a:   28 f4           brcc    .+10            ; 0x3a6 <digitalWrite+0x3c>
 39c:   21 30           cpi     r18, 0x01       ; 1
 39e:   a1 f0           breq    .+40            ; 0x3c8 <digitalWrite+0x5e>
 3a0:   22 30           cpi     r18, 0x02       ; 2
 3a2:   11 f5           brne    .+68            ; 0x3e8 <digitalWrite+0x7e>
 3a4:   14 c0           rjmp    .+40            ; 0x3ce <digitalWrite+0x64>
 3a6:   26 30           cpi     r18, 0x06       ; 6
 3a8:   b1 f0           breq    .+44            ; 0x3d6 <digitalWrite+0x6c>
 3aa:   27 30           cpi     r18, 0x07       ; 7
 3ac:   c1 f0           breq    .+48            ; 0x3de <digitalWrite+0x74>
 3ae:   24 30           cpi     r18, 0x04       ; 4
 3b0:   d9 f4           brne    .+54            ; 0x3e8 <digitalWrite+0x7e>
 3b2:   04 c0           rjmp    .+8             ; 0x3bc <digitalWrite+0x52>
 3b4:   80 91 80 00     lds     r24, 0x0080
 3b8:   8f 77           andi    r24, 0x7F       ; 127
 3ba:   03 c0           rjmp    .+6             ; 0x3c2 <digitalWrite+0x58>
 3bc:   80 91 80 00     lds     r24, 0x0080
 3c0:   8f 7d           andi    r24, 0xDF       ; 223
 3c2:   80 93 80 00     sts     0x0080, r24
 3c6:   10 c0           rjmp    .+32            ; 0x3e8 <digitalWrite+0x7e>
 3c8:   84 b5           in      r24, 0x24       ; 36
 3ca:   8f 77           andi    r24, 0x7F       ; 127
 3cc:   02 c0           rjmp    .+4             ; 0x3d2 <digitalWrite+0x68>
 3ce:   84 b5           in      r24, 0x24       ; 36
 3d0:   8f 7d           andi    r24, 0xDF       ; 223
 3d2:   84 bd           out     0x24, r24       ; 36
 3d4:   09 c0           rjmp    .+18            ; 0x3e8 <digitalWrite+0x7e>
 3d6:   80 91 b0 00     lds     r24, 0x00B0
 3da:   8f 77           andi    r24, 0x7F       ; 127
 3dc:   03 c0           rjmp    .+6             ; 0x3e4 <digitalWrite+0x7a>
 3de:   80 91 b0 00     lds     r24, 0x00B0
 3e2:   8f 7d           andi    r24, 0xDF       ; 223
 3e4:   80 93 b0 00     sts     0x00B0, r24
 3e8:   e3 2f           mov     r30, r19
 3ea:   f0 e0           ldi     r31, 0x00       ; 0
 3ec:   ee 0f           add     r30, r30
 3ee:   ff 1f           adc     r31, r31
 3f0:   ee 58           subi    r30, 0x8E       ; 142
 3f2:   ff 4f           sbci    r31, 0xFF       ; 255
 3f4:   a5 91           lpm     r26, Z+
 3f6:   b4 91           lpm     r27, Z+
 3f8:   2f b7           in      r18, 0x3f       ; 63
 3fa:   f8 94           cli
 3fc:   66 23           and     r22, r22
 3fe:   21 f4           brne    .+8             ; 0x408 <digitalWrite+0x9e>
 400:   8c 91           ld      r24, X
 402:   90 95           com     r25
 404:   89 23           and     r24, r25
 406:   02 c0           rjmp    .+4             ; 0x40c <digitalWrite+0xa2>
 408:   8c 91           ld      r24, X
 40a:   89 2b           or      r24, r25
 40c:   8c 93           st      X, r24
 40e:   2f bf           out     0x3f, r18       ; 63
 410:   08 95           ret

Quite a bit of difference in generated code output, no?  The Pin library is for those times when you absolutely have to bitbang (either you can’t use the hardware SPI port, or you’re doing something with pins that don’t involve SPI or anything SPI like at all, I’m looking at you WS2811) but still want it to be as fast as possible.  Also – this library works on the teensy 3.0 arm platform as well, reducing the hi/lo calls to just a load and a write.  (The load is required because the GPIO registers are  in a high block of memory, you need a full 32 bits to represent them, so the address to the GPIO location for a pin needs to be loaded into a register, then you can push a pin value you into).

Right now, the Pin class is written and tuned for high performance output.  Setting pins, toggling pins, with a variety of support functions to help achieve higher performance, even in environments where the pin->GPIO port mapping can’t happen at compile time.  A future post will detail how to use some of these other methods in the Pin class to squeeze the most performance out of your bit twiddling code (for example, when bitbanging SPI output, the inner loop of the fast SPI code can push out a bit every 4 cycles – 2 to determine if the current bit is hi or lo, 2 to set the data line appropriately and strobe the clock)!  In addition, a future revision of the FastSPI_LED2 library will update the Pin class to support reading data/values as well as writing them.

FastSPI_LED2, The Introducing

The FastSPI_LED library has been growing a little long in the tooth.  It is monolithic, fairly tied to the arduino platform, has a number of design decisions that reflect its scattershot growth and direction, and more.  In spite of that, it has been downloaded over five thousand times (damn, what are you all doing with it?  I’d love to see the projects!).  However, as bug reports, feature requests (eg playing nicely with other SPI devices), new led chipsets, new host chipsets (teensy 3.0, chipkit, msp430, etc..) come in it has become increasingly clear was a cleaning house.  So I stepped back and thought about the evolution of the code and the library over the past two years and where I wanted it to go, what what features were important going forward.  Then I sat down and sketched out a rough structure for the library going forward, one that focused on easy extensibility, easy portability, and of course, high performance squeezed out of small platforms.

After that, and a few weekends/weeks of grabbing blocks of time here and there to work out code and ideas, I now have code!  Code that appears to work on arduino, teensy 2.0 and teensy 3.0!  Code that appears to be speedy!  Code that makes things glow!  Since I have code that means that you too, shall shortly have code.  First, some details about FastSPI_LED2 as people will be able to use it out of the gate:

  • Initial (tested) platform support: Arduino (uno/nano), Teensy 2.0, and Teensy 3.0 (yes, ARM!)
  • Initial (tested) chipset support: ws2801, lpd8806, ucs1903, tm1809/1804, tm1803, ws2811
  • Silent switching between hardware SPI (6.6+ Mbps) usage and high performance software SPI (2.7Mbps on 16Mhz arduino!) based on selected pins
  • Software based SPI increases ability to play nicely with other SPI libraries/data
  • Ability to run multiple sets of led strips in parallel on different pins
  • Small when compiled, depending on chipsets/options, as little as a few hundred bytes (compared with over 12kb in the old library)
  • Support for using ARGB data structures (up to you to do your own A based blending, however!).  Not yet supported on all chipsets.

In addition, there’s some other pieces of the library and code that I want to call out, and will go into more detail about in subsequent posts:

  • High performance pin access library exposed for use (write only, for now)
  • High performance SPI library available for direct use (write only, for now)
  • Rapid ability to add support for new LED chipsets going forward
  • (Hopefully) Rapid ability to port the codebase to new platforms

Of course, now I imagine there are two questions being asked.  The first being where one can obtain the code, the second being how to use the code.  This being a preview release, the interface to the library is still a little bit on the raw side, and there will need to be some minor changes to your setup and show code to start using the new library.   So, grab the code from the googlecode site, and settle in for some notes on porting code over to the new library.  (This information, as it gets fleshed out, will find its way into in-library documentation files).

Porting Code

The old code involved calling a number of functions to set up the various parameters for the LED library, and then you would get a pointer to the block of memory that the library used for writing from.  This brings us to the first major set of changes to the library.  This version of the library focuses on the LED controllers, and -not- on managing your LED data itself.  So, the first major change is that responsibility for setting up the array of RGB data structures moves from the FastSPI_LED library to you, and when it comes time to call show on the controller, you give the show method a pointer to where your rgb data is.  The second major change is you have to declare your LED controller object – one object per set of leds you want to control.  So, for example, here’s some sample code for making all of your leds blink red:

// Simple rgb data structure
#include "FastSPI_LED2.h"
#define NUM_LEDS 32
struct CRGB { byte r; byte g; byte b; };
struct CRGB ledData[NUM_LEDS];

// Setup/define the Led controller with data pin 11, clock pin 13, and latch pin 10 
// this will trigger use of the hardware SPI support on the arduino uno
LPD8806Controller<11, 13, 10> LedController;

setup() { 
  // zero out all the leds
  memset(ledData, 0, sizeof(struct CRGB) * NUM_LEDS); 

  // initialize the controller

loop() { 
  // set all the LEDs to red
  for(int i = 0; i < NUM_LEDS; i++) { ledData[i].r = 255; }
  LedController.showRGB((byte*)ledData, NUM_LEDS);
  // zero out all the leds
  memset(ledData, 0, sizeof(struct CRGB) * NUM_LEDS);
  LedController.showRGB((byte*)ledData, NUM_LEDS);

Pretty straight forward, no?  Apologies for the need to cast to a byte* – it is something that should get fixed up fairly quickly.  Also I need to add controller definitions and options for various predefined pin setups (e.g. make it easier to specify hardware SPI ports for a platform that has them).  All of the LED Controller objects are instances of a base class called CLEDController, so if you want to pass your controller objects around between functions, you can just pass a reference/pointer to an object of type CLEDController, and the Right Things[tm] will happen. For now, take a peek in the FastSPI_LED2.h header file for the supported chipset classes, or look at the code sample in examples for a variety of (commented out) instantiations for various chipsets.

What’s next?

As mentioned multiple times, this is a preview release of the library.  I’m about to take off on some traveling and wanted to get this out there for people to start looking at.  There may be bugs.  Not all the platform combinations are complete, and there are more platforms I want to fully support.  The good news is that any arduino based environment not explicitly mentioned above should work out of the box, albeit a little slower, thanks to various rounds of graceful fallback in the Pin and FastSPI portions of the library.  However, as with any preview type release, I already have a handful of known issues and specific things that I want to fix before a full release or immediately after the initial release, including:

  • Software SPI optimizations assume no other pin writes will occur while output led data.  There is a fallback, slightly slower, but definitely safer, when there are other writes occurring.  I simply need a way to expose this to the user
  • Cleaner high level interface to the controllers, including the ability to manage multiple controllers
  • Documentation and sample/example code
  • Wider arduino platform support
  • Add ability to specify select pins to the non-SPI based chipsets (my test harness will make use of this to use a mux to switch between chipsets under test)
  • Add the ability to specify a -set- of pins to be triggered as a select for a controller (again, for use with abovementioned mux)
  • More performance tuning/code cleanup
  • Getting rid of the need to explicitly cast to byte* when calling show
  • Support for specifying the RGB ordering of bytes on a controller by controller basis
  • Support for ‘background output’ of LED data (teensy 3.0 has a DMA subsystem!)
  • Support for arduino-less AVR environments
  • Support for a few more led chipsets!
  • Some support library surprises

It’s been a fun 2 and a half years making and growing this library.  I’m really happy to finally be dragging this library kicking and screaming into 2013, and I’m looking forward to the next few years of this library and the things you all do with it!


Also, a shout out to blog posts that contributed to various directions my brain went into and/or information used in refining the code in this library:

Upcoming major library revision

I’ve finally gotten to dig in to a major rewrite that the FastSPI_LED library has been needing for a while.  The goal is to simultaneously make the library smaller, faster, more flexible, more portable, more maintainable, more easily extensible, and more.

I know at one point I railed against software bitbanging, and for the highest performance, that is still the case.  The new version of the FastSPI_LED library pushes the hardware SPI port up to over 6.6Mbps (up from 5Mbps).  However, if you want to use the SPI port for something else, the library will silently fall back to a bitbanging mode.  Of course, this wouldn’t be my library if it wasn’t fast – and it can push data out at 3.1Mbps when bitbanging.  This is faster than some other code can manage their hardware SPI!

The new library is also more split out into components.  Doing something where you’re writing data out to SPI, but not using LEDs?  You will be able to use the high performance SPI code independently of the rest of the library for your SPI data output (I will add support for reading over SPI in time), and this version of the SPI library will use channel select lines (though you may need to add some hardware to make it all work).

Yes – that was lines plural, up there.  Another feature of the new library is the ability to support multiple chipsets and multiple sets of output simultaneously.  Want to drive a WS2811 strip and a WS2801 strip on hardware SPI and an LPD8806 on software SPI at the same time?  You can.

The chipsets that have no clock line (TM1809, WS2811, UCS1903, etc…) have also gotten some love.  The timing for these chips is not rock solid and consistent.  Even better, I’ve made it dead simple to add support for new chipsets in this style in the future.  I simply instantiate a template with 3 pieces of timing info in nano-seconds et. voila!  Support for new chipset.  By using ns timing values for tuning the code, it also means supporting different clock rates is automatic (avr’s run at 8Mhz, 16Mhz, 20Mhz, and the teensy 3.0 can run at 24Mhz, 48Mhz, or even 96Mhz – now, I don’t need to change code at all to support a new clock speed with a new chipset!)

The library also includes a high-performance pin access library.  Set a pin in as few as 2 clocks (1 clock if you do some setup work beforehand, useful in tight loops, like, say, when bitbanging SPI output!).  Here’s what a simply blink app will look like:

void setup() { Pin<13>::setOutput(); } 
void loop() { Pin<13>::hi(); delay(20); Pin<13>::lo(); delay(20); }

The core code is written and undergoing testing now on the various arduinos I have access to (uno, mega, nano, pro-mini), as well as some arduino derived platforms (have both a teensy 2.0 and teensy 3.0 handy).  I still have some high level support code to make things easier to work with, and then it should be all wrapped up and ready to go.

In addition, over the next few weeks I hope to write up a number of posts describing the various ways that I squeezed nearly every last clock cycle out of the arduino to give you more CPU time to spend coming up with interesting things to paint your LEDs with.

Arbitrary delay looping…

While poking around online today and working on a major rewrite of FastSPI_LED – I discovered that there are apps online for generating/computing delay loop asm code for AVR and friends.  I figured, why not let the compiler do this for me?

Here’s the code snippit that i’m using to generate exacting delay loops up to 767 cycles – useful when doing things that require you to delay for an exact number of cycles:

#if defined(__arm__)
# define NOP __asm__ __volatile__ ("nop\n");
# define NOP __asm__ __volatile__ ("cp r0,r0\n");

// predeclaration to not upset the compiler
template<int CYCLES> inline void delaycycles();

// worker template - this will nop for LOOP * 3 + PAD cycles total
template<int LOOP, int PAD> inline void _delaycycles() {
    // the loop below is 3 cycles * LOOP. the LDI is one cycle,
    // the DEC is 1 cycle, the BRNE is 2 cycles if looping back and
    // 1 if not (the LDI balances out the BRNE being 1 cycle on exit)
    __asm__ __volatile__ (
        "      LDI R16, %0\n"
        "L_%=: DEC R16\n"
        "      BRNE L_%=\n"
        : /* no outputs */
        : "M" (LOOP)
        : "r16"

// usable definition
template<int CYCLES> inline void delaycycles() {
    _delaycycles<CYCLES / 3, CYCLES % 3>();

// pre-instantiations for values small enough to not need the loop
template<> inline void delaycycles<0>() {}
template<> inline void delaycycles<1>() {NOP;}
template<> inline void delaycycles<2>() {NOP;NOP;}
template<> inline void delaycycles<3>() {NOP;NOP;NOP;}

Now with arms!

Well, not really arms. I’ve just put up a new version of the library that provides preliminary support for the teensy 3.0 arm based platform. What does preliminary mean? Well:

  • Chipsets that required the timer are not supported on arm, this means the *595, hl1606, and lpd6803 chipsets. I may revisit providing support for them, though some of these chipsets also involved some asm work to make things happy.
  • The TM1809, TM1803, and UCS1903 chipsets are not yet supported. I will be working on adding support for those chipsets next.
  • SPI support, at the moment, is using a compatibility library, so it is not quite as fast as it could be. Still, much faster than AVR based arduino systems, though!

I just want to say, if you haven’t checked out the Teensy 3.0 yet, you should! This chipset smokes, performance wise. You can read a bunch about the platform here. Why am I really excited about this platform for future led projects?

  • Clock speeds up to 96Mhz
  • Ability to use DMA to drive certain classes of SPI chips (aka more cpu time for what you want to do!)
  • More ram – 16kb of ram vs. 2kb or 8kb on the avr based arduinos
  • 32 bit math operations – no more burning extra clock cycles because you want numbers > 255
  • floating point support
  • and more…

I’m looking forward to making use of a variety of these features going forward, working to keep this library fast, high performance, and let you focus on thinking about what you want your lights to do, not how you’re going to get them to do it!

Breaking out

This post is going to be a slight digression from optimization stuff (which I have a few backlogged posts that I need to get out) and LEDs (which, admittedly, i’ve been fairly focused on lately).

This past week I went out to DefCon 20 – always a good time, I love going out there.  This year, Ninja Networks outdid themselves for their party badges.  They brought in their own cellphone network, NinjaTel, and provided phones to some number of us – that would play on this cellphone network.

It was a lot of fun to mess around with!  However, they had a fairly limited launcher on there for the apps that they were providing on the device.  I wanted to get/launch other apps on it.  First app that I wanted to get was the camera.

Getting what I wanted to run…

The first thing that I played around with was just getting the apps I wanted to run.  This is actually fairly easy to do with android – if you can use adb with the device.  There are commands you can use to start an app if you know its package name:

adb shell am start -a android.intent.action.MAIN

This was certainly fun to mess around with for a bit.  Also, you can get a list of installed packages equally easily:

adb shell pm list package

So – i blew some time seeing what packages were available on the device, and what I could get to run.  It looked like most everything you would expect on an android device was here, including the camera, maps, and the play store.

Some of the apps needed network connectivity, but since I could now launch arbitrary apps getting onto a different wifi network was as simple as starting🙂

…when I want it to run

Unfortunately, this isn’t exactly an ideal way to run apps.  You’d have to stay tethered to the computer and that’s no fun!  Defaults the whole point of having a phone in the first place, no?

First thought was writing a simple replacement launcher – more as a proof of concept, since in reality, I didn’t want to lose the ninja phone launcher.  Threw something together, installed it on the device, hit the home button and … nothing.  The default launcher stayed in place, no option to switch who responded to the home button.

Did some digging and it turned out the ninja folks had hard coded the home button to launch the component – making swapping out a launcher tough without removing their launcher.  As mentioned, this isn’t something that I wanted to do.

So I did some more poking around.  i didn’t want to replace any of their apps on the device, so I couldn’t easily repurpose one of their apps to launch into my launcher.

And I was reminded of the words of Socrates…

I dialed what?

Android allows you to intercept outgoing calls.  Instead of writing a replacement launcher, I could write an auxiliary launcher, or a side launcher.  This could be fun.  I threw something together that if I went into the phone app and dialed “##cam” would launch the camera.  I liked that, so i added some more.  ##set gave me settings, ##play gave me the play store, and ##www gave me the browser (and later, chrome).

Obviously, ##ninja brought up fruit ninja🙂

Now, though, I was running out of time.  In an ideal world, I wanted to have something where a user could easily assign ## codes to any apps they had installed on the device.  Even better, would be something that caught downloads and gave users the option to assign a dial code for it.  Alas, I didn’t get my ninja phone until saturday afternoon – and while i enjoyed playing with it, that wasn’t the only thing that I was doing that day.  Even worse (well, not really ‘worse’, but you know what I mean), the time for the Ninja party was fast approaching, and I wanted to have something that I could show off there.

Luckily, the android sdk has, as one of its example apps, a simple launcher.  I grabbed that, and installed it onto the device and assigned it ##apps – this way I had a way to get at other things that I had added without having to rebuild the app.

And how…

So, without further ado – here’s (most of) the code for my ninja phone side launcher.  First off, there’s the relevant AndroidManifest.xml entries to set things up:

<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android=""
    <uses-permission android:name="android.permission.PROCESS_OUTGOING_CALLS"/>
    <application android:label="@string/app_name" android:icon="@drawable/ic_launcher">

        <activity android:name=".SideLauncher"
                <action android:name="android.intent.action.MAIN" />
                <category android:name="android.intent.category.HOME"/>
                <category android:name="android.intent.category.LAUNCHER" />
                <category android:name="android.intent.category.DEFAULT" />
        <receiver android:exported="true" android:name=".KeypadLauncher" android:permission="android.permission.PROCESS_OUTGOING_CALLS">
            <intent-filter android:priority="0">
                <action android:name="android.intent.action.NEW_OUTGOING_CALL"/>
                <action android:name="android.intent.action.PHONE_STATE"/>

And the actual receiver:

package net.tapjam.sidelauncher;

import android.content.BroadcastReceiver;
import android.content.ComponentName;
import android.content.Context;
import android.content.Intent;
import android.util.Log;

import java.util.HashMap;

 * Created by IntelliJ IDEA.
 * User: dgarcia
 * Date: 7/28/12
 * Time: 7:12 PM
 * To change this template use File | Settings | File Templates.
public class KeypadLauncher extends BroadcastReceiver {
    private static HashMap numToPackage = new HashMap();
    private static HashMap numToActivity = new HashMap();
    static { 
        numToPackage.put(getNumber("##cam"), "");
        numToPackage.put(getNumber("##set"), "");
        numToPackage.put(getNumber("##www"), "");
        numToActivity.put(getNumber("##www"), "");
        numToPackage.put(getNumber("##ninja"), "com.halfbrick.fruitninjafree");
        numToActivity.put(getNumber("##ninja"), "com.halfbrick.fruitninja.FruitNinjaActivity");
        numToPackage.put(getNumber("##apps"), "");
        numToPackage.put(getNumber("##mp3"), "");
        numToPackage.put(getNumber("##play"), "");
        numToActivity.put(getNumber("##play"), "");

    private static String getNumber(String aString) { 
        String working = aString.toLowerCase();
        StringBuffer sb = new StringBuffer();

        for(char c : working.toCharArray()) {
            char out = c;
            switch(c) {
                case 'a': case 'b': case 'c': out='2'; break;
                case 'd': case 'e': case 'f': out='3'; break;
                case 'g': case 'h': case 'i': out='4'; break;
                case 'j': case 'k': case 'l': out='5'; break;
                case 'm': case 'n': case 'o': out='6'; break;
                case 'p': case 'q': case 'r': case 's': out='7'; break;
                case 't': case 'u': case 'v': out='8'; break;
                case 'w': case 'x': case 'y': case 'z': out='9'; break;
        // Log.d("SideLauncher", "Mapping " + aString + " to " + sb.toString());

        return sb.toString();

    public void onReceive(Context context, Intent intent) {
        if (intent.getAction().equals(Intent.ACTION_NEW_OUTGOING_CALL)) {
            String number = intent.getStringExtra(Intent.EXTRA_PHONE_NUMBER);
            if (numToPackage.containsKey(number)) {
                try { 
                    String appPackage = numToPackage.get(number);
                    Intent i = new Intent("android.intent.action.MAIN");
                    if(numToActivity.containsKey(number)) { 
                        String component = numToActivity.get(number);
                        ComponentName cn = new ComponentName(appPackage,component);
                } catch (Exception e) { 
                    Log.e("SideLauncher", "Exception trying to launch app", e);

and then…

Now that i’m back from defcon, there’s the usual fires at the day job from ahving been out of the office for a few days, so I haven’t had a chance to come back to this device yet. However, I still want to make something that, using dial codes, will work for launching everything, including letting me add new launch codes without having to build a new version of the app

Oh, and in case you were wondering – yes, there was LED playing with the DefCon 20 badge. More on that in a future post, however – as I got sidetracked by the phone😉

Another quick library update

Been quiet for a while – but just updated the library to include support for the LPD8806 chipset – which is what the strip currently being sold by Adafruit Industries uses.

As always, the code can be found at the googlecode project site.