1. Design Concepts of the Intermediate Representations in COINS

1.1. Preface

There are two intermediate representations in COINS, namely, High level intermediate representation HIR and low level intermediate representation LIR. Following documents give the general view of them: The IR is intended to be the super class of HIR and LIR. The interface of symbol table information supplements them. All compiler modules are linked with each other via these intermediate representations.
Above outline documents describe the design policies and intentions, but do not give the specifications of individual methods. Readers are asked to refer to the interface specifications for that purpose as explained later.

1.2. Design policy of the intermediate representations

1.2.1. Overall policy

1.2.1.1. The role of intermediate representations
In general, a compiler transforms a source program to an object program. The transformation is done in several phases such as syntax analysis, code optimization, code generation, etc. In this process, the source program is transformed to some intermediate representation that is easier to handle by compiler modules than source program. Ad hoc representations corresponding to combinations of neighboring modules will make the compiler complicated and difficult to be expanded. If we can settle an intermediate representation as an interface common between all modules of the compiler, then the compiler can be simplified and made easy to be expanded.
COINS is the compiler infrastructure that is intended to be expanded to build various compilers by adding or modifying compiler modules. It is intended to build compilers for many languages and also compilers for many machines. It is also intended to build compilers generating good object code and compilers having distinguished characteristics such as parallelization, etc.
How to define an intermediate representation that fits to such requirements is the problem.
1.2.1.2. Requirements for intermediate representations
In compiling a source program into object code, there is a sequence of transformations which are carried out by compiler modules such as syntax analyzer, flow analyzer, optimizer, parallelizer, register allocator, code generator, and so on. By adding or replacing some of them we can construct various compilers. If we can define an intermediate representation that has the role of common interface between compiler modules, then it will become easy to add, delete, or replace compiler modules to construct a new compiler. To realize such interface, requirements for the intermediate representation are as follows: As for COINS, more specifically speaking, are required.

1.2.2. The Levels of the Intermediate Representations

There might be several levels in defining intermediate representation, from high level near to source language to low level near to machine language.
In the course of program transformation, there are several processings that are suitable to be designed in a level near source language, and also there are several other processings that are suitable to be designed in a level near machine language.
Processings suitable for the level near source language are and so on.
Processings suitable for the level near machine language are and so on.
If we define an intermediate representation at only one level, then some processings are suitable for it but there will be some of other processings that are not suitable for it. It is better to define high level intermediate representation and low level intermediate representation. There might be an opinion that defining two levels of intermediate representation causes to increase the number of transformations in the course of compilation and makes it impossible to build one-pass compiler. It is true but one-pass compilers can not generate good code and it will be permissible to exclude such compilers from the covering range of COINS. Practical compilers have many passes to produce good code, and so, it will be permissible to insert a pass to convert intermediate representation from high level to low level.
There might be another opinion that two levels of intermediate representation requires two set of modules to manage intermediate representations and causes to increase the amount of development. The size of modules to manage intermediate representations will not be so large and its development is done only once by implementers of the infrastructure and if the complexity of some compiler modules decreases, then the total amount of development will be decreased since such decrease will be seen in many compilers developed by using the infrastructure, and it means that many users of the infrastructure will have benefit.
There will be another choice to define more than two levels of intermediate representations. Intermediate representation may have many attributes that are filled or deleted in the process of compilation and some fields may be replaced with different kind of items. For example, an operand of an expression may be changed from source program variable to abstract register and then changed to physical register without changing basic structure of the expression. This means that one intermediate representation can cover many phases of compiling process and requests for more than two levels of intermediate representation will not be so strong.
It may be necessary to define more than two levels of intermediate representations for some complicated languages. For example, C language has many complicated operators and assignment statement embedded in expressions, therefore it may be better to have another intermediate representation near C language. For another example, object oriented languages require complicated treatment of objects and methods, therefore it may be better to have another intermediate representation suitable to object oriented languages.
In designing intermediate representations of COINS, we decided to have two levels of intermediate representations, that is, high level intermediate representation HIR and
low level intermediate representation LIR
as ones applicable to all cases. HIR is an intermediate representation that can cover analysis and transformation on the level of many imperative languages. LIR is an intermediate representation that can cover analysis and transformation on the level of many target machines.
As for other levels of intermediate representation, it will be decided to be introduced only when its necessity become quite obvious for some language groups or for some architecture groups.

1.2.3. Treatment of dependencies

It is desirable that intermediate representations are independent of languages and machines. If it is possible, then many compiler modules can be shared between many compilers corresponding to various combinations of source languages and target machines, except for language analyzers corresponding to specific language and last phases of code generation corresponding to specific target machine. In rigorous meaning, it will be very difficult.
Some language dependencies will remain in intermediate representations. For example, a string constant may contain language specific mark or size information to delimit the sequence of characters contained in the string. Some machine dependencies will remain in intermediate representations. For example, the size of integer number may be 32 bits or 64 bits depending on target machine even for the same source program.
It can be noticed that such differences are one of variations for each kind of language items. We can define the grammar of an intermediate representation as one independent of languages and machines leaving possibility of language/machine dependencies for some symbols used in the representation. In terms of programming languages, the interface specifications are completely independent of source languages and target machines but their field values may have some dependencies on languages and machines, that is, the specifications of the intermediate representation is independent of languages and machines but its instance may have dependencies.
In this approach, most compiler modules can be implemented almost independent of languages and machines because the interface specifications of the intermediate representation are independent of languages and machines. Some small portions of compiler modules may have dependencies on languages or machines to handle the contents of fields having dependency. In this approach, the intermediate representation generated for some specific machine can not be applied to generate machine code of other architecture having different characteristics because the instance of the intermediate representation may have dependency on the target machine. It is the point that differs from the bytecode of Java.
We take above approach and do not take the approach of Java bytecode because above approach will be suitable for generating highly efficient object code compared to the approach of the bytecode.

1.2.4. IR, HIR, LIR

1.2.4.1. IR
In the first stage of COINS development, HIR and LIR are designed to be a subclass of an intermediate representation named IR in order to make some modules common between HIR and LIR.
    IR _____ HIR
        |
        |___ LIR
Actually, the first version of LIR was implemented as the subclass of IR and a flow analyzer has been built to be able to analyze both HIR and LIR. The second version of code generator has been implemented based on new implementation of LIR, although its basic design is the same to that of LIR version 1.
At present, the new LIR is not yet made a subset of IR, and so, the hierarchy of IR is
    IR _____ HIR

             LIR
It is a future work to make it possible to build modules that are applicable both to HIR and LIR.
1.2.4.2. HIR
If we pursuit what is essential in programming languages, we can get the image of an elementary programming language that is common to many imperative languages. HIR is based on such elementary language which contains most of high level language concepts common between imperative languages. The document
Outline of High level Intermediate Representation HIR
describes about the elementary language.
Too much decomposition of high level language concepts will make the intermediate representation not suitable for program analyses based on high level language concepts. Considering it, in HIR, array element expressions are not decomposed to pointer expressions. Compound expressions such as compound assignment in C are decomposed into elementary expressions because such expressions requires special treatment for C language in data flow analysis, etc. that is desirable to be language independent.
It will be necessary to expand current specifications of HIR when we want to apply COINS to languages based on different concepts. For example, object oriented languages may require special considerations in compilation. How to treat them in HIR is left as a future problem.
1.2.4.3. LIR
LIR is designed to make many compiler components common between various machines. Code optimizer is one of major components in practical compilers. For the code optimization, efficient use of registers is very important. Considering it, registers should be explicitly represented in LIR. In the bytecode of Java, portability of object code is a major concern and does not represent registers explicitly. In COINS, importance is attached to the code optimization and applicability to many machines is pursuit by increasing retargetability.
LIR is, there for, based on a model of abstract register machine. In this model, computation is carried on by evaluating expressions and setting results to variables. A variable has its value in memory or register. Expressions may stand for a constant, a variable, or an operation of arbitrary complexity using expressions as their operands. There are control operations such as jump, conditional branch, and call. Most computers have such elementary operations as add, subtract, multiply, divide, shift, compare, and so on. Operators in LIR represent such operations.
While elementary operations are similar between machines, the mechanism of call/return and procedures of entry/exit for subprograms differ greatly machine by machine. To deal with such variety, call, return, prolog, epilog operators are provided in LIR and their expansion to machine instruction sequence is left as the work of retargeting.
In currently prevailing architectures, data are represented as binary integer or floating point number. Character code and boolean value can be represented as a binary integer of small size. LIR representation of data values takes these forms. The length of data is represented by the number of bits required to represent it. The address of memory is represented in byte of 8 bit length. Data may be a scalar or an aggregate. For example, 64 bits data may represent 64 bit integer or 4 element array of 16 bit integer. In LIR, aggregate data is distinguished from scalar data.
LIR is defined as a complete low level programming language. It can be expressed in textual form as well as in internal representation suitable for processing by compiler modules. Usually, a program in LIR is constructed by the HIR-to-LIR translator but it can be input to or output from the compiler in textual form.

1.2.5. Retargeting

There will appear various kind of computer architectures. COINS has retargeting feature so that it can be applied to new computer architecture easily. COINS generates code generator for a new machine based on specifications named Target Machine Description (TMD). The code generator looks for the best fit combination of LIR subtree pattern and corresponding target machine code sequence to generate machine code.
There are many passes of LIR transformations such as optimizers and register allocator. In this process of transformation, the fields of LIR will change according to the target machine but the grammar of LIR is the same during these transformations. For example, an operand may be changed from variable to abstract register and then real register of the target machine keeping the structure defined by the LIR grammar. LIR subtrees representing an assignment statement will be decomposed into a sequence of small subtrees each of which corresponds to actual instruction of the target machine keeping the structure defined by the LIR grammar. This means that most of backend modules can be constructed machine independently except for the last phase that generates assembly language program of the target machine. Thus, control flow graph constructor, SSA optimizer, register allocator, instruction scheduler, etc. are made to be common between all target machines.
In many cases, code generators for new machines generating fairly good code have been developed in 3 to 6 months.

1.3. Symbol table

There are such symbols as variables, subprograms, and others in a program. Since symbols in character strings are not easy to handle, they are registered in the symbol table and are handled as symbol table entries in intermediate representations. Attributes of a symbol are attached to its entry in the symbol table. Many of these attributes are decided by declarative statements and the structure of the program, or set at the time of syntax analysis and semantic analysis.
In LIR, attributes of symbols are rather simple and their declarations are contained in LIR. In HIR, attributes of symbols are complex reflecting the complexity of declaration statements.
HIR does not hold declaration statements of symbols but attributes of symbols are retained in the symbol table. The symbol table of HIR retains information not only of variables and subprogram but also of constants, types, and all other symbols, including compiler generated temporary variables and labels, and these symbols are referred to during compilation. It may be thought that constants need not be included in the symbol table. A constant is, however, not always represented in one word, and the precision and the range a constant represents vary among different compiling environments and target machines. Complexity of compiler development will be a little lightened by including constants in the symbol table.
Symbols may have attributes that depend on source language or target machine. Large portions of compiler modules, however, do not distinguish such dependencies.
By hiding such dependencies to the symbol table, greater commonality of the compiler modules could be realized.

1.4. Links to detailed specifications

There are several other documents that explain internal representations as it is mentioned in Preface. The outline documents listed up in Preface do not explain details necessary for coding modules to be integrated in COINS. For that purpose, it is necessary to read interfaces of HIR, etc. The recommended order of reading documents is as follows.

(1) This document
(2) Other outline documents
(3) The Java interfaces for HIR
(4) The Java interfaces which are the subclasses of above Java interfaces

Interfaces that describes detailed specifications of HIR:
    HIR.java     interface specifications for HIR
    Sym.java     interface specifications for the symbol table
    Type.java    interface specifications for the types 
    Flow.java    interface specifications for the flow information
    BBlock.java  interface specifications for the basic blocks
                 to be used in flow analysis
    MachineParam.java    class for machine dependent information 
    SourceLanguage.java  class for source language dependent information
In developing modules using HIR, users need only to consult the interface specifications to understand, and need not to go down to the implementation details. In consulting the interface specifications, it is suggested to first consult the specifications in the uppermost classes listed above. Only when further specifics are required, user may look for appropriate lower classes. In other words, it is strongly recommended to try first and make most use of the applicable methods in the uppermost classes, and not to start from a subclass. If one start reading from lower subclasses, he/she may be misguided to wrong applications of methods due to the lack of the knowledge of underlying conventions.
The classes listed below are prepared for overall control of the compiler. They may be referred to in the intermediate representation specifications.
      Driver      compiler path control
      IoRoot      links to input/output related information
      SymRoot     links to symbol table related matters
      HirRoot     links to HIR information 
      LirRoot     links to LIR information
      FlowRoot    links to flow information
      Message     message output control
      Debug       debug output control

1.5. Example

In this chapter, intermediate representations for a small program are shown to give a concrete image for them. Explanation of each item in them will be given in further documents listed above and in interface modules.

1.5.1. Source program

/* tpsum1.c -- Summation */

  int  printf(char*, ...);
  void getData(int x[100]);
  int a[100];
  int sum, i;

int main()
{
  getData(a);
  sum = 0;
  for (i = 0; i < 100; i++) {
    sum = sum + a[i];
  }
  printf("sum = %d\n", sum);
  return 0;
}

1.5.2. HIR

 (prog     1
  <null 0 void>
  <nullNode 2>
  (subpDef  3 void
   <subp     4 <SUBP < )> false false int> main>
  <null 0 void>
   (labeldSt 5 void
    (list 6
     <labelDef 7 _lab1>)
    (block    8 void line 8
     (expStmt  9 void line 10
      (call     10 void
       (addr     11 <PTR <SUBP < <PTR int> )> false false void>>
        <subp     12 <SUBP < <PTR int> )> false false void> getData>)
       (list 13
        (decay    14 <PTR int>
         <var      15 <VECT 100 0 int> a>))))
     (assign   16 int line 11
      <var      17 int sum>
      <const    18 int 0>)
     (for      19 void line 12
      (block    20 void
       (assign   21 int
        <var      22 int i>
        <const    23 int 0>))
     <null 0 void>
      (labeldSt 24 bool
       (list 25
        <labelDef 26 _lab5>)
       (expStmt  27 bool
        (cmpLt    28 bool
         <var      29 int i>
         <const    30 int 100>)))
      (labeldSt 31 void
       (list 32
        <labelDef 33 _lab6>)
       (block    34 void
        (assign   35 int line 13
         <var      36 int sum>
         (add      37 int
          <var      38 int sum>
          (subs     39 int
           <var      40 <VECT 100 0 int> a>
           <var      41 int i>)))
        (labeldSt 42 void
         (list 43
          <labelDef 44 _lab3>)
        <null 0 void>)))
      (expStmt  45 int
      <null 0 void>)
      (block    46 void
       (assign   47 int
        <var      48 int i>
        (add      49 int
         <var      50 int i>
         <const    51 int 1>)))
      (labeldSt 52 void
       (list 53
        <labelDef 54 _lab4>)
      <null 0 void>))
     (expStmt  55 int line 15
      (call     56 int
       (addr     57 <PTR <SUBP < <PTR char> )> true false int>>
        <subp     58 <SUBP < <PTR char> )> true false int> printf>)
       (list 59
        (decay    60 <PTR char>
         <const    61 <VECT 10 0 char> "sum = %d\n">)
        <var      62 int sum>)))
     (return   63 int line 16
      <const    64 int 0>))))
 )

1.5.3. LIR

(MODULE "tpsum1.c"
  (SYMTAB
    ("main" STATIC UNKNOWN 4 ".text" XDEF &syminfo "main" "tpsum1.c" 8 0)
    ("i" STATIC I32 4 ".bss" XDEF &syminfo "i" "tpsum1.c" 6 0)
    ("sum" STATIC I32 4 ".bss" XDEF &syminfo "sum" "tpsum1.c" 6 0)
    ("a" STATIC A3200 4 ".bss" XDEF &syminfo "a" "tpsum1.c" 5 0)
    ("getData" STATIC UNKNOWN 4 ".text" XDEF &syminfo "getData" "tpsum1.c" 19 0)
    ("printf" STATIC UNKNOWN 4 ".text" XREF &syminfo "printf" "tpsum1.c" 3 0)
    ("string.5" STATIC UNKNOWN 1 ".text" LDEF &syminfo "const_13" "" 0 0))
  (DATA "string.5" (I8 115 117 109 32 61 32 37 100 10 0))
  (FUNCTION "main"
    (SYMTAB
      ("returnvalue.3" FRAME I32 4 0)
      ("functionvalue.4" FRAME I32 4 0))
    (PROLOGUE (0 0))
   (DEFLABEL "_lab1")
    (CALL (STATIC I32 "getData") ((STATIC I32 "a")) () &id ("getData" 0))
    (SET I32 (MEM I32 (STATIC I32 "sum") &id ("sum" 1)) (INTCONST I32 0))
    (SET I32 (MEM I32 (STATIC I32 "i") &id ("i" 2)) (INTCONST I32 0))
   (DEFLABEL "_lab5")
   (DEFLABEL "_lab6")
    (SET I32 (MEM I32 (STATIC I32 "sum") &id ("sum" 7))
     (ADD I32 (MEM I32 (STATIC I32 "sum") &id ("sum" 4)) 
              (MEM I32 (ADD I32 (STATIC I32 "a") 
               (MUL I32 (MEM I32 (STATIC I32 "i") &id ("i" 5)) (INTCONST I32 4))) &id 6)))
   (DEFLABEL "_lab3")
    (SET I32 (MEM I32 (STATIC I32 "i") &id ("i" 9))
     (ADD I32 (MEM I32 (STATIC I32 "i") &id ("i" 8)) (INTCONST I32 1)))
    (JUMP (LABEL I32 "_lab5"))
   (DEFLABEL "_lab4")
    (CALL (STATIC I32 "printf") 
     ((STATIC I32 "string.5") (MEM I32 (STATIC I32 "sum") &id ("sum" 10))) 
     ((MEM I32 (FRAME I32 "functionvalue.4") &id ("functionvalue.4" 11))) &id ("printf" 12))
    (SET I32 (MEM I32 (FRAME I32 "returnvalue.3") &id ("returnvalue.3" 13))
             (INTCONST I32 0))
    (JUMP (LABEL I32 "_epilogue"))
   (DEFLABEL "_epilogue")
    (EPILOGUE (0 0) (MEM I32 (FRAME I32 "returnvalue.3") 
      &id ("returnvalue.3" 14))))
  (DATA "sum" (SPACE 4))
  (DATA "i" (SPACE 4))
  (DATA "a" (SPACE 400)))

1.5.4. Symbol table

SymTable Root

  subp    printf <SUBP < <PTR char> )> true false int> File tpsum1.c line 3 (flags 6) extern
  subp    getData <SUBP < <PTR int> )> false false void> File tpsum1.c line 19 (flags 6) public
  type    <VECT 100 0 int> (flags 12) size 400 elemCount 100 lowerBound 0
  var     a 0 <VECT 100 0 int> File tpsum1.c line 5 (flags 6) public  static
  var     sum 0 int File tpsum1.c line 6 (flags 9) public  static
  var     i 0 int File tpsum1.c line 6 (flags 9) public  static
  subp    main <SUBP < )> false false int> File tpsum1.c line 8 callList list 0 public
  type    <VECT 10 0 char> (flags 12) size 10 elemCount 10 lowerBound 0
  type    <PTR 100 0 int> size 4

SymTable main parent SymTable Root subp main

  label   _lab1 kind 1 unq:main__lab1 (flags 2)
  label   _lab2 kind 15 unq:main__lab2 (flags 2)
  label   _lab3 kind 8 unq:main__lab3 in main (flags 2)
  label   _lab4 kind 21 unq:main__lab4 in main (flags 2)
  label   _lab5 kind 6 unq:main__lab5 (flags 2)
  label   _lab6 kind 7 unq:main__lab6 (flags 2)

SymTable Constant

  intC    4 int unq:const_1
  intC    1 int unq:const_2
  intC    0 int unq:const_5
  intC    100 int unq:const_11
  intC    400 int unq:const_12
  stringC "sum = %d\n" <VECT 10 0 char> unq:const_13 (flags 6) length 10
  intC    10 int unq:const_14