Author Topic: What programming language to evangelise? (Read 27309 times)

Glom · « **on:** July 17, 2012, 03:14:34 PM »

As we were saying, C/C++ is good. Perl has become an abomination. Some other stuff too.

Continue.

So Ruby, eh? How does manipulation of text files in Ruby compare to Perl? It says it's a pure object oriented language. Doesn't that make it like Java, which drew some complaints?

Trebor · « **Reply #1 on:** July 17, 2012, 04:21:52 PM »

I used Object Pascal a lot at work some time ago and quite liked it, with the slow death of Borland Delphi there isn't much call for it any more though.
While very verbose it was very easy to write legible maintainable code quickly...

cjameshuff · « **Reply #2 on:** July 17, 2012, 04:32:19 PM »

Quote from: JayUtah on July 17, 2012, 02:38:12 PM

Here's the benefit of my long experience. If this is something you're writing for work, write it in a language and a style that's maintainable and comprehensible. Then spend your money on faster silicon. Seriously. Engineering time runs in the hundreds of dollars per hour. It costs thousands of dollars to optimize code, and thousands of dollars more over the life cycle of the code to maintain unclear (i.e., optimized) code. Spend that money on faster hardware, because it's guaranteed to give you a more predictable, more useful performance increase than trying to bum inefficiencies out of the code by hand.

Another thing to consider: there is often much more to gain by refining algorithms than there is by refining the implementation of a particular algorithm. Those features that reduce the amount of time spent writing code also allow you to implement more complex algorithms that make more effective use of the available computing power. You might gain a few percent by rewriting a C implementation of a brute force search in assembly, but you're still checking every possible entry. If you have 16 million entries, that's 16 million comparisons for every lookup. If you keep a sorted list and do a binary search, it's more like 24 comparisons per lookup...that's going to count a lot more than shaving of a few percent of run time by optimizing the original implementation.

(The C++ STL happens to have set, map, and multimap containers that are generally based on binary trees, red-black trees to be specific, allowing you to get that kind of algorithmic speedup automatically for things you'd likely use a brute force approach for in C.)

Quote from: Glom on July 17, 2012, 03:14:34 PM

So Ruby, eh? How does manipulation of text files in Ruby compare to Perl?

It has Perl-compatible regular expressions as a language level feature, and lots of built in string manipulation functions. It doesn't have all the built-in magic that Perl does, so you might have to write a little more code, but you'll actually be able to read it afterward.

Quote from: Glom on July 17, 2012, 03:14:34 PM

It says it's a pure object oriented language. Doesn't that make it like Java, which drew some complaints?

It has a much more dynamic and flexible object model than Java. The main problem with Java is that it pays much of the price of much more flexible languages without gaining the benefits (presumably because Sun didn't want to scare away C/C++ programmers with unfamiliar constructs), leaving it with few advantages and several disadvantages over C++. Ruby draws more from languages like Smalltalk and Self, and makes no attempt to replace C/C++.

An interesting language is Objective C++. Objective C is a superset of C with an additional object layer, much more dynamic and heavy-weight than the one used by C++. Objective C++ is the same additions added over C++, giving you access to both object models...the more limiting but lighter weight C++ model for high performance code, the flexible and dynamic Objective C model for UI and application logic. It sees basically no use outside of Macintosh and iOS devices, though.

JayUtah · « **Reply #3 on:** July 17, 2012, 05:57:39 PM »

Quote from: cjameshuff on July 17, 2012, 04:32:19 PM

Those features that reduce the amount of time spent writing code also allow you to implement more complex algorithms that make more effective use of the available computing power.

But beware premature optimization. Sophisticated algorithms that promise to scale well typically behave worse than O(N) or O(N²) algorithms for small N. The general rule is to use brute force until your measurements indicate that it's not fast enough. Why? Because brute-force algorithms are simpler (the design/test/maintenance win) and often compile down to better code.

Now yes, if you're using C++ standard templates, then someone else has already debugged the code and is maintaining it. But the choice of data structure and algorithm for performance purposes is notoriously mishandled in the industry.

Quote

Ruby draws more from languages like Smalltalk and Self, and makes no attempt to replace C/C++.

It's always fun to hear people comparing their object model to Smalltalk. It's a valid comparison, since Smalltalk was an object-oriented programming pioneer. But it makes me wonder why all the OO gurus don't program much in it.

ka9q · « **Reply #4 on:** July 18, 2012, 09:15:29 AM »

Quote from: cjameshuff on July 17, 2012, 04:32:19 PM

Another thing to consider: there is often much more to gain by refining algorithms than there is by refining the implementation of a particular algorithm.

In general I agree with you, but I keep running into special cases where I really do have to optimize some small piece of heavily executed code. Maybe it's because I work in communications and do a lot of signal processing. Or maybe it's because I enjoy the challenge, so I tend to find problems like that.

I claim the record for the fastest Viterbi decoder in software, using vector instructions on both the PowerPC and Intel x86. I worked on that code for years, revisiting it every time Intel or AMD released a new version of SSE. I did this on my own for satellite and ham radio use, and the NOAA space weather group uses it in production to decode telemetry from the ACE and STEREO spacecraft.

Ten years ago at work, I learned of a group implementing a proprietary DCT-based HDTV codec in some rather expensive custom hardware. Because I had been doing a lot of DSP on desktop computers in ham radio, I noticed that the new Intel P4's SSE2 vector instruction set looked like it was intended for video compression -- because it was. So I casually remarked to my boss that I could probably do our codec entirely in software on these cheap, general purpose machines.

He called my bluff and killed the hardware project. I spent days staring at the same screen of x86 assembler code, trying to shave single clock cycles off my vectorized DCT. That one screen of code kept four CPUs pretty much busy. If I could get rid of just a few more clocks, I could drop that to three CPUs...

It used to be easy to predict how fast your program would run. You grabbed your dogeared copy of the Intel or Zilog (remember them?) programmer's manual and added up the clock counts. Those days are long gone. On a modern CPU you can only run it and see, and the results can vary enormously from run to run. CPU clock speeds blew past those of RAM many years ago and that gap is only increasing, so there's two and often three layers of memory cache with zillions of model-dependent sizes and configurations.

Caching is responsible for much but not all of this unpredictability. Every modern CPU is pipelined out the wazoo because, in modern VLSI, a complex machine running at a low clock speed consumes much less power than a simple one running at high speed. Intel and AMD used to aggressively compete on clock speed above all else, and the resulting power and heat problems led them into a brick wall. Now they put 4, 8, or more CPU cores in a package and let us figure out how to use them all.

In pipelining, several instructions are in various stages of execution at once, just like an assembly line in a factory. Some machines have multiple execution units, their number and design highly model dependent, working on different segments of the instruction stream, looking for those that don't depend on others that haven't finished yet. These units often have to share certain CPU elements, waiting if necessary. Some instructions are "speculatively" executed before it's known if their results will be needed or even if the program flow would have reached them. If not, their results are thrown away.

Complex hardware interlocks ensure correct results regardless of the actual order of instruction execution. Somehow, it all magically behaves just like a single, very fast computer methodically stepping through instructions one by one. Except for timing.

Many things I expected to matter, didn't. Instruction count, clocks per instruction, and even some new instructions for cache management often made little or no difference. But I found one thing that consistently did, often dramatically: branch prediction. This is important because pipelining is so extensive. When a pipelined CPU hits an unpredicted branch, everything comes to a screeching halt as the pipeline is flushed and a new instruction stream is fetched from the new location. It's like shutting down an auto assembly line to switch to a different model. So modern CPUs have elaborate schemes to predict which branches will be taken, and I found that it pays to help them succeed.

Ordinary loops are not a problem; you simply predict that the branch at the end is taken and you'll be right for every iteration but the last. But data-dependent branching is often effectively random, and that can kill performance. A simple sequence like

Code: [Select]

if(a == 5)
     b = 1;
else
     b = 0;

or equivalently,

Code: [Select]

b = (a == 5) ? 1 : 0 can really slow things down if a is just as likely to be 5 as not. This is so common that "conditional assignment" instructions were introduced and all modern C compilers quickly used them. They emit code like

Code: [Select]

xorl %eax,%eax     ; set b to 0
cmpl $5,%edi       ; compare a with 5
sete %al	   ; if the equals flag is set, store 1 in b; otherwise do nothing

which avoids the branch.

Branch prediction is so important and pervasive that it should be a consideration in algorithm selection. One that seems simpler can actually be slower if it has a lot of unpredictable branching. A good example from my field is sequential (Fano) vs Viterbi decoding, two algorithms for decoding convolutional error correction codes. Sequential decoding requires far fewer operations per bit than Viterbi, but many depend on previous data bits that by definition are unpredictable. (If you could predict your data, there'd be no need to transmit it.) Branches are often mispredicted and there's little opportunity for parallel execution.

Viterbi decoding requires many more operations per bit but there are no data-dependent branches. Its basic operations are simple and independent and can be easily be executed in parallel on vector machines for an enormous (typically 10-15x) speedup over nonparallel code.

The bottom line is that yes, algorithm selection is important. The order of the algorithm, e.g., O(n²) vs O(n log n), often trumps everything else, especially when n is large. But in the real world, you often encounter cases where n is small and an O(n) algorithm is faster than a O(log n) algorithm because the former can take extensive advantage of hardware parallelism when the latter can't.

This is what engineering is all about, the art of making tradeoffs.

ka9q · « **Reply #5 on:** July 18, 2012, 09:26:01 AM »

Quote from: JayUtah on July 17, 2012, 05:57:39 PM

But beware premature optimization.

Quote from: Donald Knuth

Premature optimization is the root of all evil.

cjameshuff · « **Reply #6 on:** July 18, 2012, 11:05:06 AM »

Quote from: JayUtah on July 17, 2012, 05:57:39 PM

But beware premature optimization. Sophisticated algorithms that promise to scale well typically behave worse than O(N) or O(N2) algorithms for small N. The general rule is to use brute force until your measurements indicate that it's not fast enough. Why? Because brute-force algorithms are simpler (the design/test/maintenance win) and often compile down to better code.

Avoiding premature optimization was largely my point. Particularly in the case of C vs. C++ or assembly vs. C...your code might run a few percent faster, but it'll generally take longer to write, leaving you less time to look at what needs to be optimized and to implement algorithmic improvements that can potentially gain you orders of magnitude in speed. Language choice isn't irrelevant...you don't want to do heavy number crunching in Ruby...but some people make it sound like there's huge advantages to working as close to the machine level as you can.

And yes, you almost always want to start out with a trivial, brute force algorithm. Even if it's hopelessly slow for actual use, a known-correct implementation is valuable for testing the fast-but-complex version, or for tracking down bugs.

Quote from: JayUtah on July 17, 2012, 05:57:39 PM

It's always fun to hear people comparing their object model to Smalltalk. It's a valid comparison, since Smalltalk was an object-oriented programming pioneer. But it makes me wonder why all the OO gurus don't program much in it.

Smalltalk's main use does seem to be as something to compare other languages against, but this probably has more to do with the image-centric nature of most early implementations. Has any image-based language/environment gained wide acceptance?

Quote from: ka9q on July 18, 2012, 09:15:29 AM

In general I agree with you, but I keep running into special cases where I really do have to optimize some small piece of heavily executed code. Maybe it's because I work in communications and do a lot of signal processing. Or maybe it's because I enjoy the challenge, so I tend to find problems like that.

Yeah, it's not always possible to come up with a more efficient algorithm. There's only so many ways to approach finding the sum of a bunch of values or performing a scale/offset transformation on them, for example.

Quote from: ka9q on July 18, 2012, 09:15:29 AM

It used to be easy to predict how fast your program would run. You grabbed your dogeared copy of the Intel or Zilog (remember them?) programmer's manual and added up the clock counts. Those days are long gone. On a modern CPU you can only run it and see, and the results can vary enormously from run to run. CPU clock speeds blew past those of RAM many years ago and that gap is only increasing, so there's two and often three layers of memory cache with zillions of model-dependent sizes and configurations.

I have a TRS-80...it's older than I am, though. Z80-derived processors are actually still widely used in embedded systems, though I haven't worked with them. Cycle counting is still useful in the embedded world, where pipelines are shallow to nonexistent and branch prediction is generally not done, all memory is often on-chip, and highly deterministic execution is sometimes needed. One example...software USB: http://www.obdev.at/developers/articles/00003.html

Quote from: ka9q on July 18, 2012, 09:15:29 AM

Branch prediction is so important and pervasive that it should be a consideration in algorithm selection. One that seems simpler can actually be slower if it has a lot of unpredictable branching. A good example from my field is sequential (Fano) vs Viterbi decoding, two algorithms for decoding convolutional error correction codes. Sequential decoding requires far fewer operations per bit than Viterbi, but many depend on previous data bits that by definition are unpredictable. (If you could predict your data, there'd be no need to transmit it.) Branches are often mispredicted and there's little opportunity for parallel execution.

Templates in C++ actually have some application here. They're useful for more than just making type-generic code...templates can take integer arguments as well as types. I have occasionally used this to specify options at compile time, allowing the compiler to generate different versions of the code for different options. A halfway decent optimizer will recognize that a condition depending only on the template parameter can be determined at compile time and optimize the branches out.

Obviously, there's a tradeoff in code size (multiple slightly different versions of the function will make less effective use of the cache) and readability, but it's a useful technique.

ka9q · « **Reply #7 on:** July 18, 2012, 10:22:46 PM »

Quote from: ka9q on July 18, 2012, 09:15:29 AM

Code: [Select]
xorl %eax,%eax ; set b to 0 cmpl $5,%edi ; compare a with 5 sete %al ; if the equals flag is set, store 1 in b; otherwise do nothing

Oops.

I got this code out of gcc late last night and didn't look at it too closely. sete is not a conditional move instruction; it simply copies the equals flag to the destination. But this is another perfectly good way to avoid conditional branching, by directly computing the result. The compiler saw my return values and decided that this instruction was faster than a conditional, and also more portable to earlier CPUs. So I tried varying the return values but the compiler still used the sete instruction and then used some quick arithmetic to turn the resulting 0 or 1 into the values I wanted.

This demonstrates something else: compilers have gotten really good. They can surprise you, like when a GPS unit finds a better route you never knew existed.

That wasn't always the case. The x86 had always been a register starved machine, and I wrote many of my assembly routines when I looked at the code gcc generated for some heavily executed C code and realized that I could do a much better job of keeping heavily used variables in registers. Then the x86-64 came out and the number of general purpose registers doubled. All of a sudden gcc started producing pretty decent 64-bit code without constant register spills to memory. Now that 64-bit machines are common, I've pretty much stopped writing assembly code. It's just not necessary anymore; even the vector instructions can be invoked from C. It was also becoming a pain to maintain multiple assembler versions for different processor families. It's so much easier to just recompile with different flags.

ka9q · « **Reply #8 on:** July 18, 2012, 10:34:09 PM »

Quote from: cjameshuff on July 18, 2012, 11:05:06 AM

And yes, you almost always want to start out with a trivial, brute force algorithm. Even if it's hopelessly slow for actual use, a known-correct implementation is valuable for testing the fast-but-complex version, or for tracking down bugs.

Absolutely. In the video DCT I mentioned, my known-correct reference was two lines of C code that literally took me a minute to write. I was pretty confident they produced the right answers, albeit slowly, so every time I tested my hand-tuned vector code I compared its output with that from the C code.

I don't think I ever passed an actual image through my DCT before I tossed it over the fence to the rest of the codec group. I just ran lots of random bits through both and saw that they came out the same (+/- 1 LSB). I'm a big believer in fuzz testing and I was doing it before I ever heard that term. It's amazing how many bugs you'll find on totally random input that you miss even with hand-crafted corner test cases. Actually, you should do both but fuzzing is so easy. And it gives you a good excuse to take a break while you run through a few trillion random iterations.

cjameshuff · « **Reply #9 on:** July 19, 2012, 07:33:49 PM »

Quote from: ka9q on July 18, 2012, 10:22:46 PM

That wasn't always the case. The x86 had always been a register starved machine, and I wrote many of my assembly routines when I looked at the code gcc generated for some heavily executed C code and realized that I could do a much better job of keeping heavily used variables in registers.

I've only done non-trivial assembly programming on AVRs...not a register-starved architecture: 32 general-purpose registers. One of my current projects is a Forth bytecode interpreter that fits entirely in the registers, leaving all of RAM for the program's use, and I've only used half of the registers.

ka9q · « **Reply #10 on:** July 23, 2012, 03:03:49 PM »

Forth certainly can be compact, I'll grant you that. It got its start in small embedded computers at a time when they were really small.

Donnie B. · « **Reply #11 on:** July 23, 2012, 04:46:42 PM »

Quote from: ka9q on July 23, 2012, 03:03:49 PM

Forth certainly can be compact, I'll grant you that. It got its start in small embedded computers at a time when they were really small.

About a decade ago I built a Boolean logic engine for an embedded project. I had some modest knowledge of Forth, and based the logic engine on that. The product was highly resource constrained, as it was never intended to support all the features it eventually accumulated (a familiar story for us embedded folks). That made "mini-Forth" a good fit.

The Forth-like logic engine has served well over the years in a succession of products with only minor tweaks and additions. The little product I built it for has been used in applications no one ever dreamed of back at its inception -- including me.

Glom · « **Reply #12 on:** August 09, 2012, 12:41:57 PM »

Java enums though. I'm finding their power so incredibly useful. I just cracked a big barrier with my programme by using enums.

ApolloHoax.net

News:

Author Topic: What programming language to evangelise? (Read 27309 times)

Glom

What programming language to evangelise?

Trebor

Re: What programming language to evangelise?

cjameshuff

Re: What programming language to evangelise?

JayUtah

Re: What programming language to evangelise?

ka9q

Re: What programming language to evangelise?

ka9q

Re: What programming language to evangelise?

cjameshuff

Re: What programming language to evangelise?

ka9q

Re: What programming language to evangelise?

ka9q

Re: What programming language to evangelise?

cjameshuff

Re: What programming language to evangelise?

ka9q

Re: What programming language to evangelise?

Donnie B.

Re: What programming language to evangelise?

Glom

Re: What programming language to evangelise?