Between Lexics and Syntax: Whitespace, Layout, Comment Conventions

Robert D. Cameron
January 14, 2002

Fixed Format Languages

Early programming languages, such as FORTRAN 66, were fixed-format, specifying fixed character positions at which particular components of program statements and declarations were to occur. The basic conventions of FORTRAN 66 are as follows.

BNF Grammars and Implicit Whitespace

Normally, BNF grammars of programming languages are written without explicit description of the language rules for comments or whitespace. Whitespace refers to the spacing and line break characters that separate programming language tokens. A general rule followed by many languages is that whitespace or comments may be included between any two tokens, but is only necessary between adjacent alphanumeric tokens. Languages that freely allow such whitespace insertion are called free format languages.

Whitespace is Significant

FORTRAN treated blanks as completely insignificant in the body of a statement. This could sometimes be convenient, e.g., to make numbers and variable names more readable.

Two Equivalent Forms
      HIVAL = 14713678923        
      HI VAL = 14 713 678 923    

However, ignoring blanks can sometimes cause problems. For example, consider the looping statement:

      DO 3 I = 1,3

Suppose this were mistakenly coded with a period instead of a comma.

Two Equivalent Forms
      DO 3 I = 1.3    
      DO3I = 1.3      

The interpretation would be as a variable assignment.

Unfortunately, exactly this kind of error occurred in the control program for a Mariner spacecraft to Venus, resulting in its loss. (Annals of the History of Computing, 1984, 6(1), page 6).

Comment Conventions

Programming languages have two basic conventions for comments.

  1. End-of-line comments.
    Comments continue from a comment begin token until the end of the input line.
  2. Bracketted comments. Comments are enclosed in "comment brackets" (e.g., /* and */), which may span multiple lines.

When bracketted comments are used, it is important to be aware of potential errors if the final delimiter is mistakenly omitted. In this case, language statements may be skipped by the compiler, without error reports.

Syntactic Comment Positioning

Although most languages allow comments to appear anywhere between tokens, it may be useful to define conventional syntactic positions for certain kinds of comments. For example, Java documentation comments appear immediately before class, interface, method, constructor or field declarations and serve to document those declarations. With conventions about the contents of these comments, documentation can be automatically generated.

Intelligent Layout: The Offside Rule

Miranda and Haskell are functional programming languages that follow the offside rule of Landin: every token of an object must be to the right or directly below its first token.

Python has adapted this concept in its indentation rules.