The package "coins.lparallel" is the do-all type loop parallelizer (Iwasawa and Watanabe, et al.). This package analyzes the program for parallelizable loops and generates either OpenMP program (Chandra et al.) written in C or machine code (assembly language code) to be executed in parallel. When OpenMP program is generated, COINS does not do the parallelization itself, but the driver calls an external OpenMP compiler to execute the program in parallel.
Parallelization heavily depends on the computer used to execute a program and also on control program (may be an operating system) for parallel execution. COINS does not do complete-automatic parallelization. It may insert directives (pragmas) for parallelization in OpenMP case or call statements for routines that control parallel execution in machine code generation case. Such control routines should be prepared by a user according to the computer used as it will be explained later. Candidates of subprograms and loops to be parallelized should also be specified by pragmas as explained in later paragraphs.
In do-all parallelization, compiler option can specify to generate either parallelized object code or OpenMP program. The loop analyzer of COINS analyzes HIR and discriminates a for-loop to be parallelized if memory access areas covered by each iteration do not intersect with each other except for the areas covered by induction variables and reduction variables.
The analyzer also normalizes for-loops so that their loop index starts from 0 and increments by 1 changing all subscript expressions so that they are computed using the normalized loop index. All induction variables are also changed to be computed from the normalized loop index. This transformation is a simple affine transformation (Aho, et al.).(a) USE (use): may be referred in the path (b) EUSE (exposed use): may be referred before setting value in the path (c) MOD (modify): value may be set in the path (d) DDEF (definitely defined): value is definitely set in the pathAs the first step, consider paths in a basic block. If statement s1 precedes statement s2, then following relations hold:
USE(s1:s2) = USE(s1) + USE(s2) EUSE(s1:s2) = EUSE(s1) + (EUSE(s2) - DDEF(s1)) MOD(s1:s2) = MOD(s1) + MOD(s2) DDEF(s1:s2) = DDEF(s1) + DDEF(s2)where s1:s2 represent the path from s1 to s2, and + means union, - means difference. The same relations hold between 2 adjacent basic blocks. At the join point of if-then-else statement, its memory areas are computed as follows:
USE = USE1 + USE2 EUSE = EUSE1 + EUSE2 MOD = MOD1 + MOD2 DDEF = DDEF1 * DDEF2where the memory areas suffixed by 1 and 2 represent memory areas corresponding to then-part and else-part, respectively. The operator * represents intersection operation.
for (i = 0; i < n; i = i + 1) { .... }is considered to be a sequence of iterations so that at the exit point of the for-loop, its memory areas are computed as follows:
where memory areas suffixed by i and k represent
those for i-th iteration and k-th iteration,
respectively. Above formulas may be considered to
be got by expanding the loop. If the loop contains a
subscripted variable with a subscript expression that
can not be analyzed, then the entire array of the
subscripted variable is assumed to be contained in
USE, EUSE, MOD and empty is assumed about the
array for DDEF.
Loops are analyzed in the order from inner-most
loops to outer-most loops. As for the memory areas
for an inner loop, the memory areas at the exit point
of the inner loop are used. Subscripted expressions
in the inner loop are classified into
invariant, induction, and un-analyzable as follows:
Inner loop Outer loop Example view view invariant invariant c[m] induction a[j] un-analyzable b[k] induction invariant with range aa[i] induction with range cc[m] un-analyzable bb[k] un-analyzable un-analyzablewhere, the column Example shows examples of subscripted variables in Figure 6.1. For example, subscripted expressions in Figure 6.1 are classified as shown by comments.
/* Inner loop invariant subscripts */ for (j = 0; j < n; j++) { k = x[j]; for (i = 0; i < n; i++) { a[j] /* Outer loop induction subscript */ /* if it is an induction variable */ /* of the outer loop. */ b[k] /* Outer loop un-analyzable subscript */ /* if it is defined in the outer loop */ /* and is not an induction variable of */ /* the outer loop. */ c[m] /* Outer loop invariant subscript if */ /* its operands are defined neither in */ /* the outer loop nor in the inner loop*/ } } /* Inner loop induction subscripts */ for (j = 0; j < n; j++) { k = x[j]; for (i = 0; i < n; i++) { aa[i] /* Invariant subscript with range if */ /* its initial, final, increment */ /* values are all outer loop invariant*/ bb[k] /* Outer loop un-analyzable subscript */ /* if some of its initial, final, */ /* increment values are defined in */ /* the outer loop and are not its */ /* induction variable. */ } for (m = j; m < n; m = m + 1) { cc[m] /* Outer loop induction subscript */ /* with range if some of its initial */ /* or final values are outer loop */ /* induction variable and the other */ /* term and increment value are outer*/ /* loop invariant */ } }Figure 6.1 Classification of subscript expressions
Loop carried anti-dependency:
Loop carried output-dependency:
where the words loop carried flow dependency,
loop carried anti-dependency, loop carried
output-dependency are used in the same way as
D. F. Bacon, et al..
A loop having loop carried output-dependency and
loop carried anti-dependency may be parallelized by
privatization as explained later if it does not contain
loop carried flow dependency variables.
for (i = 0; i < N; i++) { x[0] = 0; x[1] = 10; for (j = 2; j < N; j++) { x[j] = c[i] * (x[j-1] + x[j-2]) / 2; z[i][j] = x[j]; } }Figure 6.2 Nested loop
DDEFj = { z[i][j], x[j], j } MODj = { z[i][j], x[j], j } USEj = { i, j, c[i], x[j], x[j-1], x[j-2] } EUSEj = { i, j, c[i], x[j-1], x[j-2] }The loop carried flow-dependency has j but it is an induction variable which is not a hazard for parallelization. As for the array x, there is anti-dependency and output-dependency. Thus the inner loop can not be parallelized. In the outer loop analysis, the data flow information of the inner loop is expanded, for example,
Thus, accessed memory areas for the entire inner loop are
DDEF = { z[i][2:N-1:1], x[2:N-1:1], j } MOD = { z[i][2:N-1:1], x[2:N-1:1], j } USE = { i, j, c[i], x[0:N-1:1] } EUSE = { i, j, c[i], x[0], x[1] }where the notation [a:b:c] represents the subscript range having start value a, final value b, and step value c.
DDEFi = { z[i][2:N-1:1], x[0:N-1:1], j, i } MODi = { z[i][2:N-1:1], x[0:N-1:1], j, i } USEi = { i, j, c[i], x[0:N-1:1] } EUSEi = { i, c[i] }The loop carried flow dependency set contains i but it is an induction variable. Both of the loop carried anti-dependency set and the loop carried output dependency set contain x[0:N-1:1]. It means that the loop can not be parallelized straight forwardly. The memory area x[0:N-1:1] is, however, included in DDEF and not included in EUSE, meaning that the area is firstly set and then used for all i, so that the outer loop can be parallelized by privatizing (allocating local variables for each thread) the array x.
(1) initiate, (2) fork, (3) join, (4) get self Id, (5) terminate, (6) preprocess for do-all loop, (7) postprocess for do-all loop.Figure 6.3 shows some of such functions.
int coinsThreadInit( .... ); int coinsThreadEnd(); int coinsThreadFork( .... ); int coinsThreadJoin( .... ); int coinsThreadSelfId(); void coinsThreadPreprocessForDoAllLoop ( .... ); void coinsThreadPostprocess( .... ); int coinsThreadForkForDoAll( .... ); void coinsThreadPreprocessForDoAllThread(); void coinsThreadPostprocessForDoAllThread(); ......Figure 6.3 Run time routines for parallelization
1) #pragma optControl functionsWithoutSideEffect distance 2) #pragma parallel doAllFunc main 3) #ifndef MAIN 4) #define MAIN 5) #endif 6) #include "coinsParallelFramework.h" 7) #include <math.h> 8) double distance(double px[],double py[],int pi); 9) double setData(double px[], double py[],int pn); 10) int printf(char*, ...); 11) double x[10000], y[10000]; 12) int main() { 13) int i, n = 720; 14) double t, sum = 0.0; 15) #pragma parallel init 16) setData(x, y, n); 17) #pragma parallel doAll 18) for (i = 0; i < n; i++) { 19) sum = sum + distance(x, y, i); 20) } 21) printf("n ans = %f \n" sum); 22) #pragma parallel end 23) return 0; 24) }Figure 6.4 Program to be parallelized
(1) Call coinsThreadPreprocessForDoAllLoop that does preprocess for parallel execution and distributes value ranges of induction variable to threads according to the execution environment. (2) Prepare parameters to be passed for each thread. (3) Invoke threads to be executed in parallel. (4) Execute a loop statement that does operations for a value range of the induction variable to be executed as the master thread. (5) Call coinsThreadPostprocess that does post process and join operation for the threads in parallel execution. (6) Synthesize reduction values from values computed by threads executed.The program in Figure 6.5 is the function corresponding to the above for-loop transformed for parallelization. It is generated from the parallelized HIR by HIR-to-C module of COINS and reformed to improve readability, although COINS can generate also machine code from the parallelized HIR. Its parameters (_indexFrom_0 and _indexTo_0) specify the value range of the loop index i and up to 2 write-back places (sum_0 and _voidPtr0) of reduction variables. The last parameter is not used in this case because there is only one reduction variable sum.
1) void *main_loop_0( int _threadId_0, long _indexFrom_0, long _indexTo_0, double *sum_0, void *_voidPtr_0 ) { 2) int i, n; 3) double sum; 4) coinsThreadPreprocessForDoAllThread(); 5) sum = 0.0; 6) { i = 0; } 7) for ( i = _indexFrom_0;i < _indexTo_0; i = i + 1) { 8) sum = sum + (& distance)(x, y, i); 9) _lab29:; 10) } 11) _lab33:; 12) *sum_0 = sum; 13) coinsThreadPostprocessForDoAllThread(); 14) }Figure 6.5 Function to be executed by threads
1) #pragma parallel forceDoAll (reduction ("max" max) ("min" min)) 2) max = x[0]; 3) min = x[0]; 4) for (i = 0; i < n; i++) { 5) if (x[i] > max) 6) max = x[i]; 7) if (x[i] < min) 8) min = x[i]; 9) }Figure 6.6 Example of forced parallelization
1) void * main_loop_1( int _threadId_0, long _indexFrom_0, long _indexTo_0, int * min_0,int * max_0 ) { 2) int i, max, min; 3) coinsThreadPreprocessForDoAllThread(); 4) min = 0; max = 0; 5) min = _min_global_0; max = _max_global_0; 6) { i = (int )0; } 7) for (i= _indexFrom_0;i < _indexTo_0; i= i+1) { 8) if (_x_global_1[i] > max) { 9) _lab77:; { max = _x_global_1[i]; } } 10) if (_x_global_1[i] < min) { 11) _lab80:; { min = _x_global_1[i]; } } 12) _lab83:; 13) } _lab87:; 14) *min_0 = min; *max_0 = max; 15) coinsThreadPostprocessForDoAllThread(); 16) }Figure 6.7 Function to be executed by threads
#pragma parallel doAllFunc func1 func2 ...designates the functions func1, func2, ... for candidates for parallelization.
#pragma parallel doAllFuncAlldesignates all functions in the given input pragram for parallelization.
#pragma parallel doAllshould be inserted. A loop encountered first after this pragma is treated as the candidate for parallelization.
java coins.driver.Driver -coins:parallelDoAll=OpenMP foo.cIt generates a C program with pragmas for OpenMP synthesizing module name by adding "loop" to the input file name (foo-loop.c in this case). Users can execute the generated C program in parallel by giving it to soem OpenMP compiler.
java coins.driver.Driver -S -coins:parallelDoAll=n foo.cwhere n is an integer number indicating maximum degree of parallelization. The assembly language program can be executed in parallel by linking with execution time routines for parallelization. The execution time routines should be provided for corresponding execution environment according to the parallelizing framework written in the previous section (Framework for parallel execution).
java coins.driver.Driver -S -coins:parallelDoAll=n,hir2c foo.cwill produce both of assembly language program and C program that can be executed in parallel by linking with execution time routines for parallelization.
java coins.lparallel.LoopPara foo.cThe LoopPara is a subclass of the COINS driver coins.driver.Driver. It passes the generated C program with OpenMP pragamas to an OpenMP compiler named Omni (OpenMP Compiler Omni) and generates executable module.
The CoCo analyzes an input C program and transforms it into a macro (coarse-grain) task graph with data/control flow dependence. Then, the CoCo parallelizes the macro tasks by using OpenMP directives for SMP machines. This analysis and transformation are carried out on COINS HIR (High-level Intermediate Representation). The CoCo generates a parallel program in HIR containing OpenMP directives as comments. The HIR program is translated into a C program with OpenMP directives by the HIR-to-C translator. Finally, it is compiled by the Omni-OpenMP compiler and then executed in parallel on a SMP machine.
Macro tasks correspond to basic blocks, loops and/or subroutines. After an input C program is divided into macro tasks, an execution starting condition of each macro task is analyzed. The execution starting condition represents whether the macro task can be executed or not at a certain time. A runtime macro task scheduler evaluates an execution starting condition of each macro task at execution time, and dynamically assigns executable one to a light load processor of a SMP machine.
The coarse-grain parallelizing module is a tool set, which consists of the following functions:
An execution starting condition is represented in a boolean expression. The operators consist of 'logical AND' and/or 'logical OR'. The operand conditions are as follows:
To obtain a coarse-grain parallel program, you should operate as follows:
java -classpath ./classes Driver -coins:cgParallel xxx.cThe generated OpenMP/C file will have suffix "cgparallel" in such way as xxx-cgparallel.c for source C file xxx.c.
omcc xxx-cgparallel.c xxx-sch.c
2000 : Output general debug information of the module.