4. Outline of the Symbol Table
4.1. Overall structure
4.1.1. Components of the symbol table
The symbol table retains information about various symbols including those
symbols which appear in the source program, such as variables, subprograms,
types and constants, and those generated by the compiler, such as temporary
variables and labels. Distinctions are made among symbols of the same name of
different scopes, and thus a new symbol table for a subprogram which opens a
new scope is created to retain the symbols which are defined in the new scope:
Symbol table (SymTable) of scope A:
The table which retains the symbols defined in scope A
Symbol table (SymTable) of scope B:
The table which retains the symbols defined in scope B
....
The symbol table as a whole takes the form of tree of symbol tables where the
tree is composed of global symbol tables and local symbol tables. The global
symbol table retains symbols that are common between subprograms in a compile
unit. The local symbol table retains symbols that has
local scope such as subprogram, block, etc. For each scope having symbols there
is a symbol table corresponding to the scope. The local symbol table
corresponding to inner scope is a child of the symbol table corresponding to
the outer scope.
Thus a tree of symbol tables is constructed reflecting the
nested structure of scopes. Constants are registered to a global symbol table
named symTableConst so that constants can be shared within the compile unit.
Other global symbols are registered to a global symbol table named symTableRoot.
[[
The tree structure of symbol tables in C language is depicted as follows:
symTableConst Constants in the source program and those generated
by a compiler.
symTableRoot Built-in symbols in HIR (such as basic types) and
| symbols globally defined.
|- symTable A Symbols defined locally in subprogram A, and
| | symbols generated by the compiler in subprogram A
| | excluding constants.
| |- symTable A_b1 Symbols defined locally in block b1 in subprogram A.
| |- symTable A_b1_2 Symbols defined locally in block b1_2 in block b1
| ....
|- symTable B Symbols defined locally in subprogram B, and
| | symbols generated by the compiler in subprogram B
| | excluding constants.
| ....
]]
Symbols are registered in the symbol table by calling factory methods var(),
subp(), label() and others in the interface Sym. Use of "new" is not totally
prohibited, but it is discouraged, and at using "new" to create directly an
entry to the symbol table, full understanding of conventions to do so is
expected. Otherwise malfunctions in the interrelationships are highly likely
to occur.
Information concerning types, scopes and others are much simplified in LIR,
so a separate LIR symbol table is created which holds information needed in
the back end only. This document is dedicated to describe the symbol table for
HIR.
4.1.2. The relationship between symbol table and compiler components
In HIR, variables, subprograms, labels and others are represented, generally,
by their references which indicate where in the symbol table they are
registered. Access methods are provided for to retrieve the type, relationship
with other symbols, and other information of various kinds about each entry
through these references.
Symbols such as variables which appear in data flow equations and so forth
are also given unique identification numbers. Methods to obtain the reference
from the identification number and vice versa are prepared.
4.1.3 Hierarchy of symbols
A symbol (Sym) is, when it is actually created, an instance of one of the
subclasses such as variable (Var), subprogram (Subp), type (Type), constant
(Const), label (Label) and others. Some of these subclasses have their
subclasses. The hierarchy of the classes is shown below:
Sym // Symbol class (super class of all symbol classes).
| // Symbols in the same scope are contained in the same
| // symbol table (instance of SymTable).
|
|- OperandSym // Symbol that can be an operand
| | // of executable expressions.
| |
| |- Var // Variable that can be assigned a value
| | | // and variable with const attribute.
| | |- Param // Formal parameter class.
| | |- Elem // Class for structure element and
| | // union element.
| |
| |- Const // Constant class.
| | |- IntConst // Integer constant.
| | |- FloatConst // Floating point constant.
| | |- CharConst // Character constant.
| | |- StringConst // String constant.
| | |- BoolConst // Boolean constant.
| | |- NamedConst // Named constant including
| | // enumeration constant (other than bool).
| |- Label // Statement label class.
|
|- Subp // Subprogram class (procedures, functions, etc.).
|
|- Type // Type information class.
| |- BaseType // Base types intrinsic
| | // to many programming languages.
| |- VectorType // Vector (1 dimensional array) type.
| |- EnumType // Enumeration type.
| |- PtrType // Pointer type.
| |- StructType // Structure type.
| |- UnionType // Union type.
| |- SubpType // Subprogram type.
| |- RegionType // Region type to define storage area
| | // shared between subprograms.
| |- DefinedType // Types defined in a program (by typedef).
|
|- ExpId // Expression identifier
// (generated to be used in data flow analysis, etc.).
Note.
Sym is actually an interface corresponding to the class SymImpl implementing
it. In the same way, Var is an interface corresponding to the class VarImpl
implementing it, and so on. Users (compiler builders) are requires only to
see interfaces and it is not necessary to see corresponding implementation
classes. It is because to separate interface and implementation class so that
changes of implementation do not affect interface. In this document, sometimes
Sym, etc. are treated in the same way as classes because it is a little
cumbersome to make such distinction in many sentences.
4.2. Major items in symbol tables
Major items to be registered in the symbol tables are listed up in this
section and major fields for each item are also listed up in each subsection
showing corresponding access methods.
4.2.1. Accessing symbol table items
All information stored in the symbol table is referred to or set only
through access methods provided for Sym or its subclasses. But one access
method is not necessarily prepared for one real field. To keep consistency
between fields, one access method might set values to multiple fields, or
an access method in a class might refer a field of another class to calculate
a value. Therefore if users access actual fields directly, then it is likely
to misuse the fields and destroys consistency. For details of the access
methods, refer to Sym interface, Type interface and their lower level interfaces.
4.2.2. Fields common to symbols of all kinds (Sym)
Name (getName()).
The kind of the symbol ( variable, subprogram, etc.; get(SymKind()).
Type (obtainable by using getSymtype() which returns null
for symbols without type. "void" is a type and not null).
The place of definition (the line where the symbol is first declared in
the source program; getDefinedLine(), getDefinedFile()).
The unique name (the unique name within the whole of the compile unit.
Where there are symbols with the same name, a compiler generates names
to avoid name duplication; getUniqueName() which comes to be in effect
after setUniqueNameToAllSym() of SymTable is called.).
Phase specific information (links to information for specific processes;
getWork()).
4.2.3. Variable (Var)
Initial value expression (getInitialValue()).
Size of storage allocated to the variable (getSize()).
(3-1) Independent variable (variable which is neither formal parameter
nor element of aggregate)
Visibility (extern, public, private, etc. ) attribute (getVisibility()).
Storage area (auto, static, register) attribute (getStorageClass()).
(3-2) Formal parameter (Param)
The index number in a paremeter list (getParamIndex()).
The distinction whether a call-by-value or a call-by-reference.
(3-3) Element (of structure or union) (Elem)
--- created for each element of a structure or a union.
Displacement (it can be expressions. evaluateDisp()).
The structure or the union which include this element (getUpperType()).
Whether or not a bit field, and if so the size in bits.
4.2.4. Subprogram (Subp)
Type of return value (getReturnValueType()).
List of formal parameters (getParamList()).
The existence of variable length argument list (getOptionalParam()).
The symbol table (getSymTable()).
The HIR subtree representing subprogram body (getSubpDefinition()).
The list of defined labels (getLabelDeList()).
The list of subprograms called (getCallList()).
Detected error count (getErrorCount()).
4.2.5. Type (Type)
Size in bytes (getSizeValue()).
Expression to calculate the size (getSizeExp()).
Whether or not this is qualified as volatile (is Volatile()).
Alignment value (getAlignment()).
Origin-type (getOrigin()).
(5-1) Basic type (BaseType)
Basic type code (represents which type it is) (getTypeKind()).
(5-2) Vector (VectorType)
Type of elements (getElemType()).
Number of elements (getElemCount()).
Expression to calculate the number of elements (getElemCountExp()).
The lower bound of the index (getLowerBound()).
Whether or not a rectangular array of fixed length (isRectangularArray()).
Note:
Array is represented as a vector or vector of arrays.
Arrays are the major targets for parallelization and optimization. In
general, an array is defined as a vector which is a one dimensional array,
or a vector of vectors, and so on. They take several forms as listed below:
- A rectangular array of fixed size;
- A conformant array whose size is specified by arguments or some other ways;
- A scattered vector (a set of vectors not necessarily placed in contact
with each other, for example multi dimensional arrays in Jave);
- Other forms of arrays.
It is advisable to make distinctions among these forms of arrays for easy
recognition of the targets of a parallelization and an optimization.
(5-3) Enumeration (EnumType)
Ordinal number of the enumerator (getEnumSeqNumber()).
(5-4) Pointer (PtrType)
The type of the pointed object (getPointedType()).
(5-5) Structure (StructType)
The list of the elements (getElemList()).
(5-6) Union (UnionType)
The list of the elements (getElemList())
(5-7) Subprogram (SubpType)
The list of the types of the formal parameters (getParamTypeList())
The existence of variable length argument (hasOptionalParam())
The type of the return value (getReturnType())
4.2.6. Constant
(6-1) Integer constant (IntConst)
The textual representation obtained by getName() has a suffix making
distinctions among types (U for unsigned, L for long, and LL for long long).
An integer constant has internal representation of long int in Java
(shortValue(), intValue(), longValue()). If the value is outside of the range
that the Java long int can represent, isOutOfValueRange() returns true, and the
internal representation is faulty.
(6-2) Floating point constant (FloatConst)
A floating point constant obtained by getName() has a suffix making
distinctions among types (F for float and D for double). A floating point
constant has internal representation of a double in Java (floatValue(),
doubleValue()). If the value is outside of the range that the Java double
can represent, isOutOfValueRange() returns true, and the internal
representation is faulty.
(6-3) Character constants (CharConst)
A character constant is denoted with enclosing quotation marks (').
The internal representation of a character constant is corresponding character
code of type int.
(6-4) Character string constant (StringConst)
Depending on source languages, a character string is denoted with enclosing
marks (such as quotation marks) or escape characters added. We call the string
of characters without these extras as "pure sequence of characters" of the
string. The internal representation of a character string in HIR takes form of
a pure sequence of characters enclosed with double quotes ("). The method
makeStringBody changes a character string constant in source language to the
pure sequence of characters, and makeCstring and others to change the pure
sequence of characters to a character string constant in the source language.
These methods are prepared in SourceLanguage class. When printing the
intermediate representation, the printing format of Java, the compiler
description language, should be followed.
Pure sequence of characters (getStringBody()).
Number of characters (getLength()).
4.2.7. Label (Label)
Link to the statement having this label (getHirPosition()).
Link to a basic block whose first statement has this label (getBBlock())
4.2.8. RegionType
RegionType is an aggregate which defines storage area commonly used, like
COMMON in Fortran, among compile units. In concept, it is like a union with
its elements defined in each one of subprograms. Since not all subprograms
resides within one compile unit, a region type stays incomplete, until link
takes place where all elements are present and linked. At that time the
region type comes to be complete. Because of this, a region type is
represented not by listing up all its elements as in the case of unions, but
by its region name (which is similar to the tags attached to a struct or a
union).
A region type has, as its attribute, a list of pairs of a subprogram and
a symbol table. Here, the symbol table consists of declaration of variables
which form the common storage area in the corresponding subprogram. Within a
subprogram, a region type is treated much the same way as a struct. This means
elements of a region type is treated much the same way as elements of a
struct. For details of the related methods, refer to regionType method in
Sym interface and methods defined in RegionType interface.
4.3. Usage of the symbol table
4.3.1. How to build the symbol table
First, as a part of initialization, symTableRoot to store symbols global
to the whole of a compiling unit, and symTableConst to store constants are
created. These are refered as
symRoot.symTableRoot and
symRoot.symTableConst
respectively.
To create a symbol table to store symbols used locally in a subprogram,
pushSymTable is to be used (a push operation). At the end of a subprogram,
popSymTable is to be used (a pop operation). At the start of a new scope
(such as a block including declarations) within a subprogram, pushSymTable is
to be used to create a symbol table to store symbols used locally within the
scope. At the place of departure from that scope, popSymTable is to be executed.
If a symbol table, say A1, is created by a push operation under a symbol
table A, A1 becomes a child of A. Assume that a symbol table A11 is created
by a push under A1, and another symbol table A12 is created by a push operation
after A11 is popped. The symbol tables A11 and A12, created under the same
symbol table A1, turn to be siblings. If A2 is pushed after A12 and then A1
are popped, A1 and A2 turn to be siblings. In this case, the structure of
the symbol table is as follows:
A --- A1 --- A11
| |- A12
|- A2
By repeating the above processes for all subprograms in a compiling unit,
the tree of symbol tables for the compile unit is built. The methods to be
used to traverse the symbol table tree are as follows:
getParent get the parent symbol table.
getFirstChild get the first child symbol table.
getBrother get the next sibling symbol table created following "this"
under the same parent.
A symbol table has its owner (getOwner()). The owner of a symbol table that
stores local variables and others of a subprogram is the subprogram itself.
Usually, more symbol tables will be created under it. The owner of
symTableRoot and symTableConst is null.
4.3.2 Symbol table initialization
At first, the compiler driver creates an instance (symRoot) of the class
SymRoot which stores information related to symbol tables, and then calls
symRoot.initiate() in order to initialize symbol tables.
Two symbol tables are created at the initialization, one is symTableRoot to
store symbols common to whole of the compile unit and the other is
symTableConst to store constants. In the process of initialization, built-in
symbols such as basic types of HIR are stored in symTableRoot. At the start of
(the outer most) subprogram, a symbol table to hold the symbols defined
locally in the subprogram is created under symTableRoot.
In symRoot, apart from symTableRoot and symConst, two more symbol table
references below exist:
symTableCurrent the symbol table of the inner most scope
currently in process.
symTableCurrentSubp the symbol table of the subprogram currently in process.
symTableCurrent and symTableCurrentSubp must be appropriately establish for
use in flow analysis, optimization and parallelization. If they are not
correctly set, methods related to symbol table handling may not work correctly.
These symbol tables can be referred to using:
symRoot.symTableRoot
symRoot.symTableCurrent
symRoot.symTableCurrentSubp
Note:
Example of procedure to prepare symbol table is shown in
examples/SimpleMain.java
of the distribution file. This example will help to know how to build symbol
table as well as how to write compilers using COINS compiler infrastructure.
4.3.3 Creating symbol tables
To create a symbol which is to be stored in a symbol table, it is strongly
recommended to use factory methods provided for the purpose, such as
symRoot.sym.defineVar(variable name, type)
symRoot.sym.defineSubp(subprogram name, type)
symRoot.sym.defineLabel(label name)
....
For details, refer to Sym interface. It is not recommended to use "new"
directly because such use will lead to programming errors.
When creating a temporary variable or an internal label for optimization
etc., following methods are available:
symRoot.symTableCurrentSubp.generateVar(type name)
symRoot.symTableCurrentSubp.generateLabel()
For details, refer to SymTable interface.
4.3.4 Iterator
Symbols in symTableA can be referenced to by first
SymIterator lSymIterator = symTableA.getSymIterator();
then repeat
lSymIterator.next()
If
lSymIterator.nextVar()
is used instead of next(), only variables (Var, Param, Elem) are retrieved one
after another, all other symbols are skipped. If Iterator is not used,
statements below can be made use of,
Sym lSym1, lSym;
lSym1 = symTableA.getFirstSym();
lSym = lSym1.getNextSym();
where the first symbol is obtained by getFirstSym(), and the following symbols
can be referred consecutively by getNextSym().
If iterator is to be used,
SymNestIterator lSymNestIterator = symTableA.getSymNestIterator();
followed by
lSymNestIterator.next()
scans the specified symbol table (symTableA) and all symbol tables below it,
and all symbols included are consecutively made possible for references.
If
lSymNestIterator.nextVar()
is used in stead of next(), all variable in symTableA and in the symbol tables
below are made available for references.