Optimizing a spin-delay function?
Yeah, that hits close to home. We make a signal aggregator for telemetry that can capture up to 512 channels of data at 30 KS/s. So its cycle time is just over 33 microseconds. A lot of filtration and aggregation has to happen, so we absolutely cannot rely on a fixed execution time or a constant execution path. I think it does use SIMD instructions on the Intel, though. And yes, it has ASICs and FPGAs in it too.
One of the Easter eggs in it is that the error code it sends to the downstream recorder, in the event it falls behind and has to drop data, is numerical "1202." I'm constantly amazed at how far ahead of its time the AGC seems to be.
And in general, you can usually make a much bigger difference by a change of approach at a higher level than you can in micro-optimizing every little piece of code. Implementing a more appropriate data structure or algorithm can make a bigger difference than any amount of time spent optimizing a search of an unsorted list.
All the guys who work on that signal aggregator used to be games programmers. Some of them go back to cartridge game days, where everything was limited and you really had to know your stuff at all levels. I love listening to these guys talk about all the innovative crazy stuff they had to do back in the day.