The Inquirer-Home

Computer languages are the real Tower of Babel

What gods or daemons forced 920 different tongues on us?
Fri Sep 19 2003, 14:41
COMPUTERS HAVE been with us since about the late 1960's and one would think that after forty years the industry would have evolved and that errors were a thing of the past. We should not still be in a situation where "computer error" is an acceptable excuse for delays or problems.

In this article I want to look at how operating systems and languages developed and how many of the problems in IT can be traced to their deficiencies or just woolly thinking.

In the beginning of computing we had machine code where the technicians would manually set a range of binary devices to define how the processing on their machine was to operate. From this came "assembler", the rather cryptic and syntactically structured language that simplified the control of computers but was unique to each processor because it translated into machine code very easily; these were the 2GLs or second-generation languages.

Next came the birth of 3GLs where interpreters and compilers could take our input, which used syntactic elements approaching normal language, and convert it into assembler or even machine code.

Computers were easier to program with these 3GLs and their use really started to explode. Very quickly it was found that the few languages in existence at the time did not suit everybody and so new languages were created.

To date there have been over 900 languages (see here) and yet despite all the attempts to produce bug-free and reliable code we are not much closer than in those early days.

In the early days, Cobol was the first general language for business applications and Fortran the first for technical applications. The grammatical rules for these languages were often a reflection of the limited memory and processing power on the early hardware, for example Fortran had variable names that were a maximum of 6 characters and both languages had a variety of techniques for using GO TO statements that we would never dream of using today.

Cobol used English words in a syntax that approximated their normal use. It was "wordy" but for people in a commercial or business environment it was relatively easy to learn and understand and its popularity increased.

The name Fortran was derived from FORmula TRANSlation, which gives you a strong indication of its function. It had the grammatical elements, technical functions and routines that made it far more useful for scientific and technical applications than Cobol.

Somewhere about the mid-1970's BASIC started to become popular. In those days it was an interpreted language and so it was the interpreter that executed on the machine, but only the interpreter was machine specific and BASIC was really the first standard language, one where the program could be moved from one of the growing range hardware platforms to another and be executed immediately.

The interesting point for me is that these early languages were geared towards simplicity of understanding. They used line-based statements and their language constructs made it relatively easy to look at the source code for a program and deduce what each statement actually did.

I recall being taught Fortran as part of my university course in 1976 and that same year, when a friend said that he was learning Basic, I took a look at it. Within two weeks of just a few hours each day, I had taught myself probably 95% of the language. It really was that easy.

In 1979, Vax computers were released and with their VMS operating system they popularised the use of virtual memory. Apart from the direct benefits to the execution of programs this also enabled the relaxing of various constraints on the existing languages because code no longer had to be shoe-horned into as little physical memory as possible.

Through the latter half of the 1970s there was also growth in Unix and C. Unix came about because the educational establishments were offering courses in "computer studies" (as it was often called) but the IT companies were loath to open the workings of their operating systems to students.

The creation of Unix solved this problem and by omitting certain features used in the real world, it was a very serviceable tool for teaching purposes. Unix took away complexity by having a very simple context in which a program would execute, by using flat text files, by omitting asynchronous traps and by having a very limited access control system which only recognised users or "super-users".

The philosophy of the C language followed similar ideas, some as a direct influence of Unix and others of its own doing. It was a language with a simple structure, one that was useful for teaching things like the processing within compilers. I also have first hand experience of this because in my university course we used a slightly simplified version of C as our base and then over the course of about 6 small projects, we created our own compiler that would output assembler code for an Interdata 7/16 machine.

C also had simple internal data structures, the most notable of which was the use of a contiguous sequence of bytes, ending with a byte of zero value, for its character strings. This was a break away from character "descriptors" used on the VMS operating system where those descriptors had a 2-byte word to define the length of the string and a pointer to the first of the contiguous bytes. Descriptors incurred a tiny additional cost in processing effort but the commercial users raised no complaints.

One of the fundamentals of C was a belief in Edsger Dijkstra's statement that the GO TO statement was harmful. His seminal paper, Go To Statement Considered Harmful was published in 1968.

The title was not of his doing but came from the editor of the magazine in which it first appeared. The text of the article makes it clear that he was not opposed to the GO TO statement on the basis of harm but only because its use was an obstacle to his notion that we should be able to define the exact point that processing has reached at any given moment in time in order that we can repeat the process to that point.

His entire argument against the GO TO statement was that "it becomes terribly hard to find a meaningful set of coordinates in which to describe the process progress." He declared that it was much easier to define a specific location in the processing by reference to the procedures, conditionals and loops that are defined the source code.

His argument only holds water if his notion of mapping each processing step holds any validity and in my view that is doubtful. The positional reference that we use in debuggers is a line number within a routine or a breakpoint on the change of a specific value and these seem to be quite adequate. Dijkstra's desire to be able to stop processing at some specific code line is made difficult when multiple operations appear in the one source code statement or when an optimising compiler has removed the explicit execution of that line.

Despite its questionable validity this notion of banning GO TOs became a sacred chant to the Unix and C people. If you asked most of them to explain why this should happen, few could get beyond "Because they're bad" and those that did often stopped at "It creates 'spaghetti' code". I can only recall one "believer" ever declaring that GO TOs were an obstruction to the simple definition of a location in the code.

Even today many developers - and not just in C - prefer to write block structured conditional code which might run to many, many lines, and even have many indents as new conditionals and loops are added. This can be confusing enough but it is quite common to find that a conditional statement has a one-line "else" statement tucked away in an easily-missed location.

Of course with good code layout it is possible to avoid both the over-use of GO TO and the over-use of indents, and that brings us to the matter of writing source code which is easy to understand.

In my early years with Fortran I was encouraged to write code by using variables that were indicative of their function and using plenty of comments that were easy to find and to read. Statement numbers were also to be kept in sequential order, even if that meant renumbering the existing statements in order to add a new one. I was also to try to avoid creating routines with more than 120 lines because the common use of fanfold 132-column paper had 60 lines per sheet and 120 lines was all that could be easily viewed at one time.

These ideas, added to the syntactic elements of Fortran (or Basic) and the line-based structure, meant that it was easy, in theory at least, for other people to look at my code and have a reasonable idea of what was happening even if they had no great skills in the language.

In a single step, C changed all this because it was biased towards Dijkstra's misguided notions and C was designed primarily for use with Unix in educational institutes.

C introduced statements that required an explicit terminator and could run to many or as few physical lines as the writer desired. The block structures that C used to replace GO TOs were easy to over-use and the code easily became convoluted. Certainly with Fortran and Basic it was possible to write code that was very difficult to understand but it was a rare day that code in these languages was more convoluted than badly written C.

C also moved away from the English language by introducing obscure symbols for pre- and post-operators in the form of x++ and ++x. It introduced array indexing that started with 0 instead of 1 as in other languages, calling these "offsets" instead of "element number". It introduced pointers, with their special syntax for definitions and introduced new symbols or combinations of symbols for use in conditional statements or in operations on fields of bits.

The use of C seriously reduced the ability for the non-expert to read the code and have a reasonable understanding of the processing algorithm.

It went further than that because many C programmers wrote such convoluted source code that even they had trouble understanding it themselves. I recall a competition for writing obscure C code back about 1980 in which the winning entry was just three physical lines of nested functions, conditionals and blocks. That winning entry was so good that 5 months later not even its author could decipher the processing.

Cobol, Fortran and Basic required considerable effort to write code but for all that effort they had a syntax that was easy to comprehend and they had a defined layout, one which encouraged the use of comments. For at least some of these languages the compilers were very complex because the language syntax demanded it. In contrast to them, C added new symbols and added statements that used words which were inconsistent with English, two factors which made it more difficult to understand. C also had various "shorthand" methods which made it easier to write the code and the syntax and data structures of C meant that its compilers were relatively simple.

This was a fundamental change in the approach to computer programming and, I believe, one that has handicapped us since that time.

In my opinion there was a shift from languages which were slow to write but relatively easy to comprehend to a language which was faster to write but more difficult to comprehend. In the former most of the complexity was in the data structures and the compiler but in C that complexity was shifted towards the source code and the compiler and data structures were simpler.

C was just the first popular block structured language that endorsed Dijkstra's credo and a whole bunch of languages appeared, some of which were special purpose languages for such applications as Artificial Intelligence, graphics and text processing.

More recently we have seen the introduction of special languages such as Perl and, specifically for the web HTML, Javascript and Java. With its platform independence, made possible by its intermediate language, Java has almost turned us full circle and back to how Basic was in the mid 1970's.

The total number of languages now stands at about 920, each presumably invented either to be compatible with the requirements of the compiler or because the existing languages were not ideal for the kind of processing that the language would define.

A few months ago I saw a comment that Microsoft were working with new a language called F# (F-sharp). Their intention appears to be to reduce the propensity for bugs in their .NET environment, a laudable enough notion but I suspect it is the same rationale as for many languages that have preceded it.

The problem is that we are not dealing with bugs in the syntax of the code because these are easily identified by a compiler. The bugs that really matter are either where mis-typed source-code is valid within the context of the program, or where faulty logic was used in the algorithm and errors result because the software failed to handle certain data conditions.

These should not occur but they are just about inevitable when software is difficult to understand. The number of "unhandled exception" errors, buffer overflows and memory leaks that are found by normal users is testament to failures in logic or failures in coding, and probably failure to understand the code or to test it to discover deficiencies. In the longer term these problems can cause any maintenance and support to be far more complex and far more error-prone activities than they should be.

It appears that no software solution will be found to correct the current problems. Despite efforts which I know to have dated from about 1980, it is still not possible to mathematically validate anything but the simplest programs. For the foreseeable future, human involvement will be necessary in the validation process but when it is handicapped by poor source-code we cannot expect much.

What this industry desperately needs is a new language, one that makes it easier to understand the source code and thus easier to avoid errors. With processor speeds now at least 2000 times faster than the hardware of the mid 1970's we can forget about any burden that we place on compilers and make that part of the business as complex as we like.

In the light of recent viruses it might also be worthwhile to create an interim solution to address the pressing problem of poor data structures and allow us to take our time over that new language. Something as simple as adding descriptors to C ("New C" ?), and modifying the compilers accordingly, would mean very few changes to existing source code but give some protection against the buffer overflow problem.

One thing is for sure, while hardware has improved in speed and capabilities since the 1970s our ability to create quality software and to understand its operation has changed very little. Until that happens we stand little chance of escaping the number of patches and corrected software releases that are foisted upon us. µ

Further reading
A Brief History of Programming Languages
Programming Languages

 

Share this:

blog comments powered by Disqus
Advertisement
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

Advertisement
INQ Poll

Heartbleed bug discovered in OpenSSL

Have you reacted to Heartbleed?