6. Parallelization for HIR

COINS has features to aid do-all parallelization (Aho, et al.), SMP parallelization and SIMD parallelization. The do-all parallelization and SMP parallelization are done for HIR. The SIMD parallelization (Suzuki, et al.), is done for LIR, and it is explained in other chapter ( SIMD Parallelization for LIR).

6.1. Loop Parallelization

6.1.1. Overview of do-all parallelization

The package "coins.lparallel" is the do-all type loop parallelizer (Iwasawa and Watanabe, et al.). This package analyzes the program for parallelizable loops and generates either OpenMP program (Chandra et al.) written in C or machine code (assembly language code) to be executed in parallel. When OpenMP program is generated, COINS does not do the parallelization itself, but the driver calls an external OpenMP compiler to execute the program in parallel.

Parallelization heavily depends on the computer used to execute a program and also on control program (may be an operating system) for parallel execution. COINS does not do complete-automatic parallelization. It may insert directives (pragmas) for parallelization in OpenMP case or call statements for routines that control parallel execution in machine code generation case. Such control routines should be prepared by a user according to the computer used as it will be explained later. Candidates of subprograms and loops to be parallelized should also be specified by pragmas as explained in later paragraphs.

In do-all parallelization, compiler option can specify to generate either parallelized object code or OpenMP program. The loop analyzer of COINS analyzes HIR and discriminates a for-loop to be parallelized if memory access areas covered by each iteration do not intersect with each other except for the areas covered by induction variables and reduction variables.

The analyzer also normalizes for-loops so that their loop index starts from 0 and increments by 1 changing all subscript expressions so that they are computed using the normalized loop index. All induction variables are also changed to be computed from the normalized loop index. This transformation is a simple affine transformation (Aho, et al.).

If machine code generation is specified, each of parallelizable for-loops is transformed to a subprogram to be executed in parallel as slave threads and the original loop is transformed to control the execution as a master thread. The execution of such threads is controlled by runtime routines designed according to a parallelizing framework.

In the do-all parallelization, functions to be analyzed and candidates of loops to be parallelized are selected by pragma in source program because it is difficult to judge automatically whether a loop can be speeded up by parallelization or not. The loop analyzer discriminates whether the candidate loops can be really parallelized, and extracts information required for parallel execution.

There is also a feature for forcing parallelization of a loop even if the loop analyzer can not discriminate the loop is parallelizable, where information for parallelization should be given by programmer as a pragma.

6.1.2. Loop analysis for parallelization

If a for-loop has such characteristics as its memory areas accessed by each iteration do not intersect with each other except for the areas accessed by induction variables and reduction variables, then the loop can be executed in parallel. The loop analyzer of COINS does such analysis (Iwasawa) for nested loops as described in this section.

A memory area is represented by a set of variables in which the variables are located. When a path of control flow graph is picked up, 4 kinds of memory areas will correspond to it, namely
(a) USE  (use): may be referred in the path
(b) EUSE (exposed use): may be referred before 
         setting value in the path
(c) MOD  (modify): value may be set in the path
(d) DDEF (definitely defined): value is definitely set
         in the path
As the first step, consider paths in a basic block. If statement s1 precedes statement s2, then following relations hold:
    USE(s1:s2)  = USE(s1) + USE(s2)  
    EUSE(s1:s2) = EUSE(s1) + (EUSE(s2) - DDEF(s1)) 
    MOD(s1:s2)  = MOD(s1) + MOD(s2)
    DDEF(s1:s2) = DDEF(s1) + DDEF(s2)
where s1:s2 represent the path from s1 to s2, and + means union, - means difference. The same relations hold between 2 adjacent basic blocks. At the join point of if-then-else statement, its memory areas are computed as follows:
   USE  = USE1 + USE2 
   EUSE = EUSE1 + EUSE2 
   MOD  = MOD1 + MOD2 
   DDEF = DDEF1 * DDEF2 
where the memory areas suffixed by 1 and 2 represent memory areas corresponding to then-part and else-part, respectively. The operator * represents intersection operation.

A for-loop of the form
   for (i = 0; i < n; i = i + 1) { .... } 
is considered to be a sequence of iterations so that at the exit point of the for-loop, its memory areas are computed as follows:
formula6.1.jpg

where memory areas suffixed by i and k represent those for i-th iteration and k-th iteration, respectively. Above formulas may be considered to be got by expanding the loop. If the loop contains a subscripted variable with a subscript expression that can not be analyzed, then the entire array of the subscripted variable is assumed to be contained in USE, EUSE, MOD and empty is assumed about the array for DDEF.

Loops are analyzed in the order from inner-most loops to outer-most loops. As for the memory areas for an inner loop, the memory areas at the exit point of the inner loop are used. Subscripted expressions in the inner loop are classified into invariant, induction, and un-analyzable as follows:

  Inner loop    Outer loop          Example 
        view          view                            
  invariant     invariant              c[m]
                induction              a[j]
                un-analyzable          b[k]
  induction     invariant with range  aa[i]
                induction with range  cc[m]
                un-analyzable         bb[k]
  un-analyzable un-analyzable
where, the column Example shows examples of subscripted variables in Figure 6.1. For example, subscripted expressions in Figure 6.1 are classified as shown by comments.
/* Inner loop invariant subscripts */
for (j = 0; j < n; j++) {
  k = x[j];
  for (i = 0; i < n; i++) {
    a[j]  /* Outer loop induction subscript      */
          /* if it is an induction variable      */
          /* of the outer loop.                  */
    b[k]  /* Outer loop un-analyzable subscript  */
          /* if it is defined in the outer loop  */
          /* and is not an induction variable of */ 
          /* the outer loop.                     */
    c[m]  /* Outer loop invariant subscript if   */
          /* its operands are defined neither in */
          /* the outer loop nor in the inner loop*/
  }
} 
/* Inner loop induction subscripts */
for (j = 0; j < n; j++) {
  k = x[j];
  for (i = 0; i < n; i++) {
    aa[i]  /* Invariant subscript with range if  */
           /* its initial, final, increment      */
           /* values are all outer loop invariant*/
    bb[k]  /* Outer loop un-analyzable subscript */
           /* if some of its initial, final,     */
           /* increment values are defined in    */ 
           /* the outer loop and are not its     */
           /* induction variable.                */
  }
  for (m = j; m < n; m = m + 1) {
    cc[m]  /* Outer loop induction subscript    */
           /* with range if some of its initial */
           /* or final values are outer loop    */
           /* induction variable and the other  */
           /* term and increment value are outer*/
           /* loop invariant */
  }
}
Figure 6.1 Classification of subscript expressions

If following memory areas are not empty for a loop, they are considered as a hazard for do-all parallelization:

Loop carried flow dependency:
formula6.2.jpg

Loop carried anti-dependency:

formula6.3.jpg

Loop carried output-dependency:

formula6.4.jpg

where the words loop carried flow dependency, loop carried anti-dependency, loop carried output-dependency are used in the same way as D. F. Bacon, et al..

A loop having loop carried output-dependency and loop carried anti-dependency may be parallelized by privatization as explained later if it does not contain loop carried flow dependency variables.

6.1.3. Parallelization of nested loops

Let's consider an example of nested loop shown in Figure 6.2 where N is a constant defined by #define.
  for (i = 0; i < N; i++) {
    x[0] = 0;
    x[1] = 10;
    for (j = 2; j < N; j++) {
      x[j] = c[i] * (x[j-1] + x[j-2]) / 2;
      z[i][j] = x[j];
    }
  }
Figure 6.2 Nested loop

The inner loop of this example has following characteristics:
   DDEFj = { z[i][j], x[j], j }
   MODj  = { z[i][j], x[j], j }
   USEj  = { i, j, c[i], x[j], x[j-1], x[j-2] }
   EUSEj = { i, j, c[i], x[j-1], x[j-2] }
The loop carried flow-dependency has j but it is an induction variable which is not a hazard for parallelization. As for the array x, there is anti-dependency and output-dependency. Thus the inner loop can not be parallelized. In the outer loop analysis, the data flow information of the inner loop is expanded, for example,
formula6.5.jpg

Thus, accessed memory areas for the entire inner loop are

   DDEF = { z[i][2:N-1:1], x[2:N-1:1], j }
   MOD  = { z[i][2:N-1:1], x[2:N-1:1], j }
   USE  = { i, j, c[i], x[0:N-1:1] }
   EUSE = { i, j, c[i], x[0], x[1] }
where the notation [a:b:c] represents the subscript range having start value a, final value b, and step value c.

As for the outer loop, accessed memory areas of i-th iteration are
   DDEFi = { z[i][2:N-1:1], x[0:N-1:1], j, i }
   MODi  = { z[i][2:N-1:1], x[0:N-1:1], j, i }
   USEi  = { i, j, c[i], x[0:N-1:1] }
   EUSEi = { i, c[i] }
The loop carried flow dependency set contains i but it is an induction variable. Both of the loop carried anti-dependency set and the loop carried output dependency set contain x[0:N-1:1]. It means that the loop can not be parallelized straight forwardly. The memory area x[0:N-1:1] is, however, included in DDEF and not included in EUSE, meaning that the area is firstly set and then used for all i, so that the outer loop can be parallelized by privatizing (allocating local variables for each thread) the array x.

6.1.4. Framework for parallel execution

As it is already mentioned, COINS generates parallelized machine code as well as OpenMP program based on the loop analysis explained in above sections. In case of generating parallelized machine code, a set of run time routines to control parallel execution should be provided for COINS according to a simple framework. The kernel part of the parallelizing framework is composed of functions to do
  (1) initiate,
  (2) fork,
  (3) join,
  (4) get self Id,
  (5) terminate,
  (6) preprocess for do-all loop,
  (7) postprocess for do-all loop.
Figure 6.3 shows some of such functions.
  int  coinsThreadInit( .... );
  int  coinsThreadEnd();
  int  coinsThreadFork( .... );
  int  coinsThreadJoin( .... );
  int  coinsThreadSelfId();
  void coinsThreadPreprocessForDoAllLoop ( .... );
  void coinsThreadPostprocess( .... );
  int  coinsThreadForkForDoAll( .... ); 
  void coinsThreadPreprocessForDoAllThread();
  void coinsThreadPostprocessForDoAllThread();
    ......
Figure 6.3 Run time routines for parallelization

The framework might be thought as a small subset of POSIX Thread but it is settled as small as possible so that it is easily implemented. It realizes parallel execution irrespective of operating system is provided or not for various execution environments. Some of these functions may be implemented in a few machine instructions. In such cases, inline assemble feature (asm-statement in C language) will help to realize parallelization with small overhead.

6.1.5. Explanation using examples

The program in Figure 6.4 computes the length of a curve whose x, y coordinates are given by arrays x and y. The left-most numbers show line numbers.
 1) #pragma optControl 
        functionsWithoutSideEffect distance
 2) #pragma parallel doAllFunc main 
 3) #ifndef MAIN
 4) #define MAIN
 5) #endif
 6) #include "coinsParallelFramework.h"
 7) #include <math.h>
 8) double distance(double px[],double py[],int pi);
 9) double setData(double px[], double py[],int pn);
10) int printf(char*, ...);
11) double x[10000], y[10000];
12) int main() {
13)   int i, n = 720;
14)   double t, sum = 0.0;
15) #pragma parallel init
16)   setData(x, y, n);
17) #pragma parallel doAll
18)   for (i = 0; i < n; i++) {
19)     sum = sum + distance(x, y, i);
20)   } 
21)   printf("n ans = %f \n" sum);
22) #pragma parallel end
23)   return 0;
24) }
Figure 6.4 Program to be parallelized

Line 1 declares that the function distance has no side effect and loops calling it can be parallelized. Line 2 indicates the function to be analyzed. Line 6 includes declarations of the parallelizing framework. Line 15 initiates parallelization. Line 17 indicates a candidate for-loop, and line 22 terminates parallelization.

COINS transforms the loop to be parallelized to a function and generates following sequence of statements at the place of the loop.
(1) Call coinsThreadPreprocessForDoAllLoop
   that does preprocess for parallel execution and distributes 
   value ranges of induction variable to threads 
   according to the execution environment.
(2) Prepare parameters to be passed for each thread.
(3) Invoke threads to be executed in parallel.
(4) Execute a loop statement that does operations for 
   a value range of the induction variable to be executed 
   as the master thread.
(5) Call coinsThreadPostprocess
   that does post process and join operation for the threads
   in parallel execution.
(6) Synthesize reduction values from values computed 
   by threads executed.
The program in Figure 6.5 is the function corresponding to the above for-loop transformed for parallelization. It is generated from the parallelized HIR by HIR-to-C module of COINS and reformed to improve readability, although COINS can generate also machine code from the parallelized HIR. Its parameters (_indexFrom_0 and _indexTo_0) specify the value range of the loop index i and up to 2 write-back places (sum_0 and _voidPtr0) of reduction variables. The last parameter is not used in this case because there is only one reduction variable sum.
 1) void *main_loop_0( int _threadId_0, 
        long _indexFrom_0, long _indexTo_0, 
        double *sum_0, void *_voidPtr_0 ) {
 2)   int i, n;
 3)   double sum;
 4)   coinsThreadPreprocessForDoAllThread();
 5)   sum = 0.0;
 6)   { i = 0; }
 7)   for ( i = _indexFrom_0;i < _indexTo_0;
            i = i + 1) {
 8)     sum = sum + (& distance)(x, y, i);
 9) _lab29:;
10)   }
11)   _lab33:;
12)   *sum_0 = sum;
13)   coinsThreadPostprocessForDoAllThread();
14) }
Figure 6.5 Function to be executed by threads

For some cases, the loop analyzer may be unable to discriminate that a given loop can be parallelized or not. Such loop can also be parallelized if information required for parallelization is given by pragma. Figure 6.6 shows an example of such manual parallelization. In this case, the pragma in line 1 shows that this loop does reduction operations getting maximum value and minimum value, and corresponding reduction variables are max and min.
 1) #pragma parallel forceDoAll 
        (reduction ("max" max) ("min" min))
 2)  max = x[0];
 3)  min = x[0];
 4)  for (i = 0; i < n; i++) { 
 5)   if (x[i] > max)
 6)     max = x[i];
 7)   if (x[i] < min)
 8)     min = x[i];
 9)  }
Figure 6.6 Example of forced parallelization

COINS generates function main_loop_1 in Figure 6.7 as the function to be executed by threads. The write-back places of reduction values are min_0 and max_0. Initial values for the reduction variables min and max are given as global variables _min_global_0 and _max_global_0 as they are seen in line 5.
 1) void * main_loop_1( int _threadId_0,
       long _indexFrom_0, long _indexTo_0,
       int * min_0,int * max_0 ) {
 2)  int i, max, min;
 3)  coinsThreadPreprocessForDoAllThread();
 4)  min = 0; max = 0;
 5)  min = _min_global_0; max = _max_global_0;
 6)  { i = (int )0; }
 7)  for (i= _indexFrom_0;i < _indexTo_0;
          i= i+1) {
 8)    if (_x_global_1[i] > max) {
 9)     _lab77:; { max = _x_global_1[i]; } }
10)    if (_x_global_1[i] < min) {
11)     _lab80:; { min = _x_global_1[i]; } }
12)    _lab83:;
13)  } _lab87:;
14)  *min_0 = min; *max_0 = max;
15)  coinsThreadPostprocessForDoAllThread();
16) }
Figure 6.7 Function to be executed by threads

6.1.6. An example of parallelization

To show the applicability of COINS to embedded systems, a low-priced FPGA board (Virtex-4 FX of Xilinx) was used as an experimental environment. Multiple CPU soft cores of MicroBlaze xilinx architecture (MicroBlaze) were implemented on it.
A code generator for the CPU was implemented in 2 months by writing TMD for MicroBlaze.

This system can achieve reasonable parallelization for applications whose memory load factor is not high. Laplacian filter programs are one kind of such applications and they are used frequently in image processing.
On the FPGA board with 3 CPU cores (clock: 75MHz), a Laplacian filter program with a parallelizing-compiler-option was executed in 2.7 seconds for an image data of 160 * 120 pixels. When it is executed on 1 CPU, it took 7.5 seconds. Thus, it achieved 2.9 times speedup compared to the program without parallelizing-compiler-option for this case.

6.1.7. Usage

6.1.7.1. Designate candidates for parallelization

It is difficult to automatically judge whether a given loop can be speeded up by parallelization or not. In the loop parallelization of COINS, progrgammer dwsignates candidates for parallelization by pragma. The parallelizing module of COINS analyzes the designated loops and discriminates whether the parallelization is possible for the loops and if possible, generates parallelized C program or parallelized machine code.

A pragma of the form
    #pragma parallel doAllFunc func1 func2 ...
designates the functions func1, func2, ... for candidates for parallelization.
The pragma
    #pragma parallel doAllFuncAll
designates all functions in the given input pragram for parallelization.
As for functions that are not candidates, processing of parallelization is skipped.

Before a loop expected to be parallelized, the pragma
    #pragma parallel doAll
should be inserted. A loop encountered first after this pragma is treated as the candidate for parallelization.
One pragma should be written in one line.

6.1.7.2. Generation of parallelized C program

To generate OpenMP program written in C, give a compile command of the form
  java coins.driver.Driver -coins:parallelDoAll=OpenMP foo.c
It generates a C program with pragmas for OpenMP synthesizing module name by adding "loop" to the input file name (foo-loop.c in this case). Users can execute the generated C program in parallel by giving it to soem OpenMP compiler.
(There may be other configuration options, such as environment variables, needed to execute the resultant code in parallel: see your OpenMP compiler manual.)

6.1.7.3. Generation of parallelized machine code

To generate assembly language code executable in parallel, give a compile command of the form
  java coins.driver.Driver -S -coins:parallelDoAll=n foo.c
where n is an integer number indicating maximum degree of parallelization. The assembly language program can be executed in parallel by linking with execution time routines for parallelization. The execution time routines should be provided for corresponding execution environment according to the parallelizing framework written in the previous section (Framework for parallel execution).

A command such as
  java coins.driver.Driver -S -coins:parallelDoAll=n,hir2c foo.c
will produce both of assembly language program and C program that can be executed in parallel by linking with execution time routines for parallelization.

6.1.7.4. Generation of parallelized execution module

To output executable code using the OpenMP compiler, give a compile command of the form
  java coins.lparallel.LoopPara foo.c
The LoopPara is a subclass of the COINS driver coins.driver.Driver. It passes the generated C program with OpenMP pragamas to an OpenMP compiler named Omni (OpenMP Compiler Omni) and generates executable module.

6.2. Coarse Grain Parallelizing Module

6.2.1. Overview

6.2.1.1. Design Concept

The coarse-grain parallelizing module is constructed for realizing a coarse-grain parallelizing compiler named CoCo in java. The CoCo is the research product and it is still at the infant stage as a parallelizing compiler. Therefore it contains many constraints for practical usage as mentioned later. We have found a lot of important issues which should be solved as practical coarse-grain parallelizing compilers by implementing the CoCo as an automatic parallelizing compiler. The coarse-grain parallelizing module is a part of the COINS infrastructure, and then the module components are available as a set of parts for coarse-grain parallelization.

The CoCo analyzes an input C program and transforms it into a macro (coarse-grain) task graph with data/control flow dependence. Then, the CoCo parallelizes the macro tasks by using OpenMP directives for SMP machines. This analysis and transformation are carried out on COINS HIR (High-level Intermediate Representation). The CoCo generates a parallel program in HIR containing OpenMP directives as comments. The HIR program is translated into a C program with OpenMP directives by the HIR-to-C translator. Finally, it is compiled by the Omni-OpenMP compiler and then executed in parallel on a SMP machine.

Macro tasks correspond to basic blocks, loops and/or subroutines. After an input C program is divided into macro tasks, an execution starting condition of each macro task is analyzed. The execution starting condition represents whether the macro task can be executed or not at a certain time. A runtime macro task scheduler evaluates an execution starting condition of each macro task at execution time, and dynamically assigns executable one to a light load processor of a SMP machine.

The coarse-grain parallelizing module is a tool set, which consists of the following functions:

  1. Divides an input C program into macro tasks based on basic blocks,
  2. Analyses an execution starting condition of each macro task,
  3. Embeds OpenMP directives for parallel execution at a macro task level,
  4. Schedules dynamically each macro task to a processor of a SMP machine.

6.2.1.2. Data Structures

The coarse-grain parallelizing module utilizes a macro flow graph model. Nodes of the graph correspond macro tasks. As for edges between nodes, there are two types of edges representing control flow and data flow dependences.

An execution starting condition is represented in a boolean expression. The operators consist of 'logical AND' and/or 'logical OR'. The operand conditions are as follows:

  1. If macro tasks with data dependence have been executed or decided not to be executed,
  2. If control flow dependence to a macro task has been decided.

6.2.1.3. Scheduler

The runtime macro task scheduler is independently attached to the main part of an output program. If an input program is named 'xxx.c', the scheduler written in C language is located at the file named 'xxx-sch.c' at the same directory.

6.2.2. Constraints in the current implementation

The current version of the coarse-grain parallelizing module, CoCo, has the following constraints:
  1. The coarse-grain parallelizing module parallelizes only a main function. When an input program has several functions, the module ignores the other functions.
  2. To execute a program in parallel efficiently, the module should adjust grain granularity of tasks such as 'loop unrolling'. Up to now, the module does not do that.
  3. A loop in a program is translated into a single macro task. The module recognizes only reserved words in HIR such as 'while' and/or 'for' as loops. Other types of loop are not translated into macro tasks.
  4. The module finds out an exit macro task only if the task has no successors or includes return statements. Other macro tasks which include 'exit()' functions, for example, are not recognized as exit ones.
  5. When there are some macro tasks which have no dependence with each other, the execution order of macro tasks may be different from the order of sequential execution.

6.2.3. Usage

The CoCo inserts OpenMP directives into a HIR program as comments for coarse-grain parallelizing. The CoCo utilizes 'hir2c' module, translator from HIR to an OpenMP program written in C (OpenMP/C program), since the back end of COINS does not support the OpenMP directives yet. After a coarse-grain parallel OpenMP/C program is generated by hir2c, you must compile it by an OpenMP compiler in order to execute in parallel on a SMP machine.

To obtain a coarse-grain parallel program, you should operate as follows:

  1. Compile 'xxx.c' by COINS C compiler specifying the option '-coins:coarseGrainParallel' or '-coins:cgParallel'.
     java -classpath ./classes Driver -coins:cgParallel xxx.c
    
    The generated OpenMP/C file will have suffix "cgparallel" in such way as xxx-cgparallel.c for source C file xxx.c.
  2. Compile the program with a runtime scheduler by Omni-OpenMP 'omcc'.
      omcc xxx-cgparallel.c xxx-sch.c
    

6.2.4. Options

There are several compile time options for the coarse-grain parallelizing module. For other options of the COINS Compiler Driver, see 2. How to use the Compiler Driver or 3. How to use C Compiler.
-coins:trace=MDF.xxxx
To output trace information of this module for debugging, and specify the trace level as follows:
        2000 :  Output general debug information of the module.
-coins:stopafterhir2c
Quit compilation of each compile unit just after generating a C program by 'hir2c'.
-coins:coarseGrainParallel
-coins:cgParallel
Use this module.

6.3 References

  1. [1] A. V. Aho, M. S. Lam, R. Sethi, J. D. Ullman: Compilers - Principles, Techniques, & Tools, Second Edition, Addison Wesley (2006).
  2. [2] M. Suzuki, N. Fujinami, T. Fukuoka, T. Watanabe, I. Nakata: SIMD optimization in COINS compiler infrastructure, Proc. of IWIA 2005, pp. 131-140, (January 2005).
  3. [3] K. Iwasawa: Automatic parallelization method of loops by conditional region analysis, Proc. of the 16th IASTED International Conference Applied Informatics, pp. 310-312, (1998).
  4. [4] Tan Watanabe, Tetsuro Fujise, Koichiro Mori, Kyoko Iwasawa, Ikuo Nakata: Design assists for embedded systems in the COINS Compiler Infrastructure, Proc. of IWIA 2007, pp. 60-69 (Jan. 2007).
  5. [5] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, R. Menon: Parallel Programming in OpenMP, Morgan Kaufmann Publishers (2001).
  6. [6] D. F. Bacon, S. L. Graham, O. J. Sharp: Compiler transformations for high-performance computing, ACM Computing Surveys, Vol. 26, No. 4, pp. 345-420 (Dec. 1994).
  7. [7] Xilinx, Inc., MicroBlaze Architecture, http://www.xilinx.com/.
  8. [8] Omni OpenMP Compiler Project: OpenMP Compiler Omni, http://phase.hpcc.jp/Omni/