Another thing to consider: there is often much more to gain by refining algorithms than there is by refining the implementation of a particular algorithm.
In general I agree with you, but I keep running into special cases where I really do have to optimize some small piece of heavily executed code. Maybe it's because I work in communications and do a lot of signal processing. Or maybe it's because I enjoy the challenge, so I tend to find problems like that.
I claim the record for the fastest Viterbi decoder in software, using vector instructions on both the PowerPC and Intel x86. I worked on that code for years, revisiting it every time Intel or AMD released a new version of SSE. I did this on my own for satellite and ham radio use, and the NOAA space weather group uses it in production to decode telemetry from the ACE and STEREO spacecraft.
Ten years ago at work, I learned of a group implementing a proprietary DCT-based HDTV codec in some rather expensive custom hardware. Because I had been doing a lot of DSP on desktop computers in ham radio, I noticed that the new Intel P4's SSE2 vector instruction set looked like it was intended for video compression -- because it was. So I casually remarked to my boss that I could probably do our codec entirely in software on these cheap, general purpose machines.
He called my bluff and killed the hardware project. I spent days staring at the same screen of x86 assembler code, trying to shave single clock cycles off my vectorized DCT. That one screen of code kept four CPUs pretty much busy. If I could get rid of just a few more clocks, I could drop that to three CPUs...
It used to be easy to predict how fast your program would run. You grabbed your dogeared copy of the Intel or Zilog (remember them?) programmer's manual and added up the clock counts. Those days are long gone. On a modern CPU you can only run it and see, and the results can vary enormously from run to run. CPU clock speeds blew past those of RAM many years ago and that gap is only increasing, so there's two and often three layers of memory cache with zillions of model-dependent sizes and configurations.
Caching is responsible for much but not all of this unpredictability. Every modern CPU is pipelined out the wazoo because, in modern VLSI, a complex machine running at a low clock speed consumes much less power than a simple one running at high speed. Intel and AMD used to aggressively compete on clock speed above all else, and the resulting power and heat problems led them into a brick wall. Now they put 4, 8, or more CPU cores in a package and let us figure out how to use them all.
In pipelining, several instructions are in various stages of execution at once, just like an assembly line in a factory. Some machines have multiple execution units, their number and design highly model dependent, working on different segments of the instruction stream, looking for those that don't depend on others that haven't finished yet. These units often have to share certain CPU elements, waiting if necessary. Some instructions are "speculatively" executed before it's known if their results will be needed or even if the program flow would have reached them. If not, their results are thrown away.
Complex hardware interlocks ensure correct results regardless of the actual order of instruction execution. Somehow, it all magically behaves just like a single, very fast computer methodically stepping through instructions one by one. Except for timing.
Many things I expected to matter, didn't. Instruction count, clocks per instruction, and even some new instructions for cache management often made little or no difference. But I found one thing that consistently did, often dramatically: branch prediction. This is important because pipelining is so extensive. When a pipelined CPU hits an unpredicted branch, everything comes to a screeching halt as the pipeline is flushed and a new instruction stream is fetched from the new location. It's like shutting down an auto assembly line to switch to a different model. So modern CPUs have elaborate schemes to predict which branches will be taken, and I found that it pays to help them succeed.
Ordinary loops are not a problem; you simply predict that the branch at the end is taken and you'll be right for every iteration but the last. But data-dependent branching is often effectively random, and that can kill performance. A simple sequence like
if(a == 5)
b = 1;
else
b = 0;
or equivalently,
b = (a == 5) ? 1 : 0
can really slow things down if a is just as likely to be 5 as not. This is so common that "conditional assignment" instructions were introduced and all modern C compilers quickly used them. They emit code like
xorl %eax,%eax ; set b to 0
cmpl $5,%edi ; compare a with 5
sete %al ; if the equals flag is set, store 1 in b; otherwise do nothing
which avoids the branch.
Branch prediction is so important and pervasive that it should be a consideration in algorithm selection. One that seems simpler can actually be slower if it has a lot of unpredictable branching. A good example from my field is sequential (Fano) vs Viterbi decoding, two algorithms for decoding convolutional error correction codes. Sequential decoding requires far fewer operations per bit than Viterbi, but many depend on previous data bits that by definition are unpredictable. (If you could predict your data, there'd be no need to transmit it.) Branches are often mispredicted and there's little opportunity for parallel execution.
Viterbi decoding requires many more operations per bit but there are no data-dependent branches. Its basic operations are simple and independent and can be easily be executed in parallel on vector machines for an enormous (typically 10-15x) speedup over nonparallel code.
The bottom line is that yes, algorithm selection is important. The order of the algorithm, e.g., O(n
2) vs O(n log n), often trumps everything else, especially when
n is large. But in the real world, you often encounter cases where
n is small and an O(n) algorithm is faster than a O(log n) algorithm because the former can take extensive advantage of hardware parallelism when the latter can't.
This is what engineering is all about, the art of making tradeoffs.