11. How do computer languages work?

We've already discussed how programs are run. Every program ultimately has to execute as a stream of bytes that are instructions in your computer's machine language. But human beings don't deal with machine language very well; doing so has become a rare, black art even among hackers.

Almost all Unix code except a small amount of direct hardware-interface support in the kernel itself is nowadays written in a high-level language. (The `high-level' in this term is a historical relic meant to distinguish these from `low-level' assembler languages, which are basically thin wrappers around machine code.)

There are several different kinds of high-level languages. In order to talk about these, you'll find it useful to bear in mind that the source code of a program (the human-created, editable version) has to go through some kind of translation into machine code that the machine can actually run.

11.1. Compiled languages

The most conventional kind of language is a compiled language. Compiled languages get translated into runnable files of binary machine code by a special program called (logically enough) a compiler. Once the binary has been generated, you can run it directly without looking at the source code again. (Most software is delivered as compiled binaries made from code you don't see.)

Compiled languages tend to give excellent performance and have the most complete access to the OS, but also to be difficult to program in.

C, the language in which Unix itself is written, is by far the most important of these (with its variant C++). FORTRAN is another compiled language still used among engineers and scientists but years older and much more primitive. In the Unix world no other compiled languages are in mainstream use. Outside it, COBOL is very widely used for financial and business software.

There used to be many other compiler languages, but most of them have either gone extinct or are strictly research tools. If you are a new Unix developer using a compiled language, it is overwhelmingly likely to be C or C++.

11.2. Interpreted languages

An interpreted language depends on an interpreter program that reads the source code and translates it on the fly into computations and system calls. The source has to be re-interpreted (and the interpreter present) each time the code is executed.

Interpreted languages tend to be slower than compiled languages, and often have limited access to the underlying operating system and hardware. On the other hand, they tend to be easier to program and more forgiving of coding errors than compiled languages.

Many Unix utilities, including the shell and bc(1) and sed(1) and awk(1), are effectively small interpreted languages. BASICs are usually interpreted. So is Tcl. Historically, the most important interpretive language has been LISP (a major improvement over most of its successors). Today, Unix shells and the Lisp that lives inside the Emacs editor are probably the most important pure interpreted languages.

11.3. P-code languages

Since 1990 a kind of hybrid language that uses both compilation and interpretation has become increasingly important. P-code languages are like compiled languages in that the source is translated to a compact binary form which is what you actually execute, but that form is not machine code. Instead it's pseudocode (or p-code), which is usually a lot simpler but more powerful than a real machine language. When you run the program, you interpret the p-code.

P-code can run nearly as fast as a compiled binary (p-code interpreters can be made quite simple, small and speedy). But p-code languages can keep the flexibility and power of a good interpreter.

Important p-code languages include Python, Perl, and Java.