4. Outline of the Symbol Table

4.1. Overall structure

4.1.1. Components of the symbol table

The symbol table retains information about various symbols including those symbols which appear in the source program, such as variables, subprograms, types and constants, and those generated by the compiler, such as temporary variables and labels. Distinctions are made among symbols of the same name of different scopes, and thus a new symbol table for a subprogram which opens a new scope is created to retain the symbols which are defined in the new scope:
    Symbol table (SymTable) of scope A:
        The table which retains the symbols defined in scope A
    Symbol table (SymTable) of scope B:
        The table which retains the symbols defined in scope B
    ....
The symbol table as a whole takes the form of tree of symbol tables where the tree is composed of global symbol tables and local symbol tables. The global symbol table retains symbols that are common between subprograms in a compile unit. The local symbol table retains symbols that has local scope such as subprogram, block, etc. For each scope having symbols there is a symbol table corresponding to the scope. The local symbol table corresponding to inner scope is a child of the symbol table corresponding to the outer scope.
Thus a tree of symbol tables is constructed reflecting the nested structure of scopes. Constants are registered to a global symbol table named symTableConst so that constants can be shared within the compile unit. Other global symbols are registered to a global symbol table named symTableRoot.
[[
The tree structure of symbol tables in C language is depicted as follows:
symTableConst    Constants in the source program and those generated 
                 by a compiler. 
symTableRoot     Built-in symbols in HIR (such as basic types) and 
   |             symbols globally defined.
   |- symTable A    Symbols defined locally in subprogram A, and 
   |  |             symbols generated by the compiler in subprogram A 
   |  |             excluding constants.
   |  |- symTable A_b1    Symbols defined locally in block b1 in subprogram A.
   |     |- symTable A_b1_2   Symbols defined locally in block b1_2 in block b1
   |            ....
   |- symTable B    Symbols defined locally in subprogram B, and 
   |  |             symbols generated by the compiler in subprogram B 
   |  |             excluding constants.
   |     ....
]]
Symbols are registered in the symbol table by calling factory methods var(), subp(), label() and others in the interface Sym. Use of "new" is not totally prohibited, but it is discouraged, and at using "new" to create directly an entry to the symbol table, full understanding of conventions to do so is expected. Otherwise malfunctions in the interrelationships are highly likely to occur.
Information concerning types, scopes and others are much simplified in LIR, so a separate LIR symbol table is created which holds information needed in the back end only. This document is dedicated to describe the symbol table for HIR.

4.1.2. The relationship between symbol table and compiler components

In HIR, variables, subprograms, labels and others are represented, generally, by their references which indicate where in the symbol table they are registered. Access methods are provided for to retrieve the type, relationship with other symbols, and other information of various kinds about each entry through these references.
Symbols such as variables which appear in data flow equations and so forth are also given unique identification numbers. Methods to obtain the reference from the identification number and vice versa are prepared.

4.1.3 Hierarchy of symbols

A symbol (Sym) is, when it is actually created, an instance of one of the subclasses such as variable (Var), subprogram (Subp), type (Type), constant (Const), label (Label) and others. Some of these subclasses have their subclasses. The hierarchy of the classes is shown below:
   Sym // Symbol class (super class of all symbol classes).
     | //  Symbols in the same scope are contained in the same
     | //  symbol table (instance of SymTable).
     |
     |- OperandSym // Symbol that can be an operand
     |   |         //   of executable expressions.
     |   |
     |   |- Var     // Variable that can be assigned a value
     |   |   |      //  and variable with const attribute.
     |   |   |- Param // Formal parameter class.
     |   |   |- Elem  // Class for structure element and
     |   |            //  union element.
     |   |
     |   |- Const // Constant class.
     |   |   |- IntConst    // Integer constant.
     |   |   |- FloatConst  // Floating point constant.
     |   |   |- CharConst   // Character constant.
     |   |   |- StringConst // String constant.
     |   |   |- BoolConst   // Boolean constant.
     |   |   |- NamedConst  // Named constant including
     |   |                  //  enumeration constant (other than bool).
     |   |- Label  // Statement label class.
     |
     |- Subp  // Subprogram class (procedures, functions, etc.).
     |
     |- Type  // Type information class.
     |    |- BaseType    // Base types intrinsic
     |    |              //  to many programming languages.
     |    |- VectorType   // Vector (1 dimensional array) type.
     |    |- EnumType     // Enumeration type.
     |    |- PtrType      // Pointer type.
     |    |- StructType   // Structure type.
     |    |- UnionType    // Union type.
     |    |- SubpType     // Subprogram type.
     |    |- RegionType   // Region type to define storage area 
     |    |               //   shared between subprograms.
     |    |- DefinedType  // Types defined in a program (by typedef).
     |
     |- ExpId  // Expression identifier
               // (generated to be used in data flow analysis, etc.).
Note.
Sym is actually an interface corresponding to the class SymImpl implementing it. In the same way, Var is an interface corresponding to the class VarImpl implementing it, and so on. Users (compiler builders) are requires only to see interfaces and it is not necessary to see corresponding implementation classes. It is because to separate interface and implementation class so that changes of implementation do not affect interface. In this document, sometimes Sym, etc. are treated in the same way as classes because it is a little cumbersome to make such distinction in many sentences.

4.2. Major items in symbol tables

Major items to be registered in the symbol tables are listed up in this section and major fields for each item are also listed up in each subsection showing corresponding access methods.

4.2.1. Accessing symbol table items

All information stored in the symbol table is referred to or set only through access methods provided for Sym or its subclasses. But one access method is not necessarily prepared for one real field. To keep consistency between fields, one access method might set values to multiple fields, or an access method in a class might refer a field of another class to calculate a value. Therefore if users access actual fields directly, then it is likely to misuse the fields and destroys consistency. For details of the access methods, refer to Sym interface, Type interface and their lower level interfaces.

4.2.2. Fields common to symbols of all kinds (Sym)

  Name (getName()).
  The kind of the symbol ( variable, subprogram, etc.; get(SymKind()).
  Type (obtainable by using getSymtype() which returns null  
        for symbols without type. "void" is a type and not null).
  The place of definition (the line where the symbol is first declared in 
        the source program; getDefinedLine(), getDefinedFile()).
  The unique name  (the unique name within the whole of the compile unit.
        Where there are symbols with the same name, a compiler generates names 
        to avoid name duplication; getUniqueName() which comes to be in effect 
        after setUniqueNameToAllSym() of SymTable is called.).
  Phase specific information (links to information for specific processes; 
        getWork()).

4.2.3. Variable (Var)

  Initial value expression (getInitialValue()).
  Size of storage allocated to the variable (getSize()).

(3-1) Independent variable (variable which is neither formal parameter
      nor element of aggregate)
  Visibility (extern, public, private, etc. ) attribute (getVisibility()).
  Storage area (auto, static, register) attribute (getStorageClass()).

(3-2) Formal parameter (Param)
  The index number in a paremeter list (getParamIndex()).
  The distinction whether a call-by-value or a call-by-reference.

(3-3) Element (of structure or union) (Elem)
       --- created for each element of a structure or a union.
  Displacement (it can be expressions. evaluateDisp()).
  The structure or the union which include this element (getUpperType()).
  Whether or not a bit field, and if so the size in bits. 

4.2.4. Subprogram (Subp)

  Type of return value (getReturnValueType()).
  List of formal parameters (getParamList()).
  The existence of variable length argument list (getOptionalParam()).
  The symbol table (getSymTable()).
  The HIR subtree representing subprogram body (getSubpDefinition()).
  The list of defined labels (getLabelDeList()).
  The list of subprograms called (getCallList()).
  Detected error count (getErrorCount()).

4.2.5. Type (Type)

  Size in bytes (getSizeValue()).
  Expression to calculate the size (getSizeExp()).
  Whether or not this is qualified as volatile (is Volatile()).
  Alignment value (getAlignment()).
  Origin-type (getOrigin()).

(5-1) Basic type (BaseType)
  Basic type code (represents which type it is) (getTypeKind()).

(5-2) Vector (VectorType)
  Type of elements (getElemType()).
  Number of elements (getElemCount()).
  Expression to calculate the number of elements (getElemCountExp()).
  The lower bound of the index (getLowerBound()).
  Whether or not a rectangular array of fixed length (isRectangularArray()).
Note:
Array is represented as a vector or vector of arrays. Arrays are the major targets for parallelization and optimization. In general, an array is defined as a vector which is a one dimensional array, or a vector of vectors, and so on. They take several forms as listed below: It is advisable to make distinctions among these forms of arrays for easy recognition of the targets of a parallelization and an optimization.
(5-3) Enumeration (EnumType)
    Ordinal number of the enumerator (getEnumSeqNumber()).

(5-4) Pointer (PtrType)
  The type of the pointed object (getPointedType()).
  
(5-5) Structure (StructType)
  The list of the elements (getElemList()).
  
(5-6) Union (UnionType)
  The list of the elements (getElemList())
  
(5-7) Subprogram (SubpType)
  The list of the types of the formal parameters (getParamTypeList())
  The existence of variable length argument (hasOptionalParam())
  The type of the return value (getReturnType())

4.2.6. Constant

(6-1) Integer constant (IntConst)
The textual representation obtained by getName() has a suffix making distinctions among types (U for unsigned, L for long, and LL for long long). An integer constant has internal representation of long int in Java (shortValue(), intValue(), longValue()). If the value is outside of the range that the Java long int can represent, isOutOfValueRange() returns true, and the internal representation is faulty.

(6-2) Floating point constant (FloatConst)
A floating point constant obtained by getName() has a suffix making distinctions among types (F for float and D for double). A floating point constant has internal representation of a double in Java (floatValue(), doubleValue()). If the value is outside of the range that the Java double can represent, isOutOfValueRange() returns true, and the internal representation is faulty.

(6-3) Character constants (CharConst)
A character constant is denoted with enclosing quotation marks ('). The internal representation of a character constant is corresponding character code of type int.

(6-4) Character string constant (StringConst)
Depending on source languages, a character string is denoted with enclosing marks (such as quotation marks) or escape characters added. We call the string of characters without these extras as "pure sequence of characters" of the string. The internal representation of a character string in HIR takes form of a pure sequence of characters enclosed with double quotes ("). The method makeStringBody changes a character string constant in source language to the pure sequence of characters, and makeCstring and others to change the pure sequence of characters to a character string constant in the source language. These methods are prepared in SourceLanguage class. When printing the intermediate representation, the printing format of Java, the compiler description language, should be followed.
  Pure sequence of characters (getStringBody()).
  Number of characters (getLength()).

4.2.7. Label (Label)

  Link to the statement having this label (getHirPosition()).
  Link to a basic block whose first statement has this label (getBBlock())

4.2.8. RegionType

RegionType is an aggregate which defines storage area commonly used, like COMMON in Fortran, among compile units. In concept, it is like a union with its elements defined in each one of subprograms. Since not all subprograms resides within one compile unit, a region type stays incomplete, until link takes place where all elements are present and linked. At that time the region type comes to be complete. Because of this, a region type is represented not by listing up all its elements as in the case of unions, but by its region name (which is similar to the tags attached to a struct or a union).
A region type has, as its attribute, a list of pairs of a subprogram and a symbol table. Here, the symbol table consists of declaration of variables which form the common storage area in the corresponding subprogram. Within a subprogram, a region type is treated much the same way as a struct. This means elements of a region type is treated much the same way as elements of a struct. For details of the related methods, refer to regionType method in Sym interface and methods defined in RegionType interface.

4.3. Usage of the symbol table

4.3.1. How to build the symbol table

First, as a part of initialization, symTableRoot to store symbols global to the whole of a compiling unit, and symTableConst to store constants are created. These are refered as
  symRoot.symTableRoot and
  symRoot.symTableConst
respectively.
To create a symbol table to store symbols used locally in a subprogram, pushSymTable is to be used (a push operation). At the end of a subprogram, popSymTable is to be used (a pop operation). At the start of a new scope (such as a block including declarations) within a subprogram, pushSymTable is to be used to create a symbol table to store symbols used locally within the scope. At the place of departure from that scope, popSymTable is to be executed.
If a symbol table, say A1, is created by a push operation under a symbol table A, A1 becomes a child of A. Assume that a symbol table A11 is created by a push under A1, and another symbol table A12 is created by a push operation after A11 is popped. The symbol tables A11 and A12, created under the same symbol table A1, turn to be siblings. If A2 is pushed after A12 and then A1 are popped, A1 and A2 turn to be siblings. In this case, the structure of the symbol table is as follows:
     A --- A1 --- A11
       |       |- A12
       |- A2
By repeating the above processes for all subprograms in a compiling unit, the tree of symbol tables for the compile unit is built. The methods to be used to traverse the symbol table tree are as follows:
   getParent      get the parent symbol table.
   getFirstChild  get the first child symbol table.
   getBrother     get the next sibling symbol table created following "this"  
                  under the same parent.
A symbol table has its owner (getOwner()). The owner of a symbol table that stores local variables and others of a subprogram is the subprogram itself. Usually, more symbol tables will be created under it. The owner of symTableRoot and symTableConst is null.

4.3.2 Symbol table initialization

At first, the compiler driver creates an instance (symRoot) of the class SymRoot which stores information related to symbol tables, and then calls symRoot.initiate() in order to initialize symbol tables.
Two symbol tables are created at the initialization, one is symTableRoot to store symbols common to whole of the compile unit and the other is symTableConst to store constants. In the process of initialization, built-in symbols such as basic types of HIR are stored in symTableRoot. At the start of (the outer most) subprogram, a symbol table to hold the symbols defined locally in the subprogram is created under symTableRoot.
In symRoot, apart from symTableRoot and symConst, two more symbol table references below exist:
   symTableCurrent      the symbol table of the inner most scope 
                        currently in process.
   symTableCurrentSubp  the symbol table of the subprogram currently in process.
symTableCurrent and symTableCurrentSubp must be appropriately establish for use in flow analysis, optimization and parallelization. If they are not correctly set, methods related to symbol table handling may not work correctly. These symbol tables can be referred to using:
    symRoot.symTableRoot
    symRoot.symTableCurrent
    symRoot.symTableCurrentSubp
Note:
Example of procedure to prepare symbol table is shown in
    examples/SimpleMain.java
of the distribution file. This example will help to know how to build symbol table as well as how to write compilers using COINS compiler infrastructure.

4.3.3 Creating symbol tables

To create a symbol which is to be stored in a symbol table, it is strongly recommended to use factory methods provided for the purpose, such as
    symRoot.sym.defineVar(variable name, type)
    symRoot.sym.defineSubp(subprogram name, type)
    symRoot.sym.defineLabel(label name)
    .... 
For details, refer to Sym interface. It is not recommended to use "new" directly because such use will lead to programming errors.
When creating a temporary variable or an internal label for optimization etc., following methods are available:
    symRoot.symTableCurrentSubp.generateVar(type name)
    symRoot.symTableCurrentSubp.generateLabel()
For details, refer to SymTable interface.

4.3.4 Iterator

Symbols in symTableA can be referenced to by first
     SymIterator lSymIterator = symTableA.getSymIterator();
then repeat
     lSymIterator.next()
If
     lSymIterator.nextVar()
is used instead of next(), only variables (Var, Param, Elem) are retrieved one after another, all other symbols are skipped. If Iterator is not used, statements below can be made use of,
     Sym lSym1, lSym;
     lSym1 = symTableA.getFirstSym();
     lSym  = lSym1.getNextSym();
where the first symbol is obtained by getFirstSym(), and the following symbols can be referred consecutively by getNextSym().
If iterator is to be used,
     SymNestIterator lSymNestIterator = symTableA.getSymNestIterator();
followed by
     lSymNestIterator.next()
scans the specified symbol table (symTableA) and all symbol tables below it, and all symbols included are consecutively made possible for references.
If
     lSymNestIterator.nextVar()
is used in stead of next(), all variable in symTableA and in the symbol tables below are made available for references.