Thanks to Bing and Wikipedia, I found this code which is meant to be a very portable, byte-level implementation. That's just what you want for a microcontroller. I tried it out first on my x86 machine using the demo given on the site. It worked great. Then I moved the code over to the MSP430. It didn't need any Modification for CCS to compile it and call the AES functions just like the demo code did. Looking at the buffers while debugging confirmed they were working great.
I was curious just how much time the encrypt and decrypt functions were taking on the MSP430 (specifically the MSP430G2553). CCS doesn't do code profiling for the MSP430 but it can count CPU cycles (while debugging go to Run->Clock->whatever). The problem with that is that it seriously slows down debugging can take forever for very long routines. I had it counting with the encrypt function for many minutes and it still hadn't finished. For a few lines at a time the break point overhead is about the same so you won't really notice and it does work well. A little searching and one solution that seems obvious in hindsight was to use the built in timer to count the cycles. Start the timer before the routine, stop it right after, and then check the register.
The MSP430 has, in general, a 16-bit timer so it can count to 2^16 or 65536 (I was really disappointed when I noticed my car's odometer rolled past that point by a couple hundred miles. That would have been a great picture). If you're sure your code takes less time than that, it's probably a close measure of how long it takes. If the timer overflows, then the register value is no good. I was suspecting overflow so I had the timer's ISR count the overflows. That adds some additional latency, making the count inaccurate, but I wanted a ballpark estimate, not something exact. Sure enough the timer was overflowing. It was overflowing 15 times! That's for encryption or decryption. At 1 MHz the darn thing was taking about 1 full second! No freaking way I could use that for a radio packet encryption tool.
Going back to the CCS cycle counter, if the function takes 1 second and I was waiting over 3 minutes for CCS to count its cycles, that means there's a huge overhead involved. Like orders of magnitude longer. For things that take a few hundred cycles, that's fine. Debugging overhead is high anyway and you might not notice. A million cycles, though, and it's not worth it. If CCS seems stuck on a function you know doesn't take long when it's counting cycles, that's a sign it's taking too long and you should probably use the timer.
The AES code has two ways of performing encryption and decryption. The first generates some needed numbers using functions and the other uses a couple of big look-up tables. The trade-offs are processing time versus memory consumption. Everything above was tested using function calls. I switched the code to use the look-up tables and the program needed an additional 236 bytes for program space. Not terrible but keep in mind the MSP430Gx series has max 16kB flash and some have as little as 2kB. That could also change with optimization. When I timed it again using look-up tables there were zero overflows. The encryption and decryption functions had a speedup of about 50x! They were taking around 20ms when running at 1 MHz. That's a little more manageable. I'm sure there's room for improvement, too. The compiler is also nice enough to give me some "Infos" on where code could run faster if it were rewritten a little. Not today, though.
So, in summary, look-up tables can definitely be worth the additional memory cost. That's been known forever but this was a real eye opener for me to learn that lesson personally. I wouldn't say I learned it the hard way because someone already wrote those tables for me. Thanks to Ilya O. Levin and Hal Finney for the AES code they wrote. Also, it's no wonder they put AES into instruction sets and make hardware accelerators. It can a lot of cycles. If you look in the MSP430x5xx and MSP430x6xx Family User's Guide, for example, it says that its AES accelerator (for 128-bit keys) takes as few as 167 MCLK cycles. So using look-up tables was a couple of orders of magnitude faster than without and using a hardware accelerator is a couple orders of magnitude faster than that. Wow.
And that's one more thing...
(Yeah, it's not on an image. But it's fun anyway)