11. Structure and Extensions of the Backend Process

11.1. How to call Backend Processes

The backend processes are called from coins.driver.Driver as follows:
00    /* pass 7 -- Code generation */
01    Root root = new Root(spec, new PrintWriter(System.out, true));
02    String targetName
03      = coinsOptions.getArg(CommandLine.COINS_TARGET_NAME_OPTION);
04    String targetConvention
05      = coinsOptions.getArg(CommandLine.COINS_TARGET_CONVENTION_OPTION);
06    trace.trace(myName, 5000, "target name = " + targetName);
07    Module unit = Module.loadSLir(sexp, targetName, targetConvention, root);
08    makeCSourceFromLir("new", lirToCTimings, unit, io);
09
10    if(spec.getCoinsOptions().isSet("snapshot"))
11      snap.shot(unit,"Generated LIR");
12
13    /* SSA optimization */
14    if (spec.getCoinsOptions().isSet(coins.ssa.OptionName.SSA_OPT)) {
15      unit.apply(new coins.ssa.SsaDriver(unit, io, spec));
16      /* Simple/JumpOpt again */
17      unit.apply(coins.backend.opt.SimpleOpt.trig);
18      unit.apply(coins.backend.opt.JumpOpt.trig);
19    } else {
20      unit.basicOptimization();
21    }
22    if (spec.getCoinsOptions().isSet("simd")) {
23      unit.apply(new coins.simd.SimdDriver(unit, io, spec));
24    }
25    makeCSourceFromLir("opt", lirToCTimings, unit, io);
26
27    if(spec.getCoinsOptions().isSet("snapshot"))
28      snap.shot(unit,"Optimized LIR");
29
30    if(spec.getCoinsOptions().isSet("snapshot"))
31      snap.generateXml();
32
33    unit.generateCode(out);
34
35    if (trace.shouldTrace("Sym", 1)) {
36      trace.trace("Sym", 1, "\nSym after code generation ");
37      symRoot.symTable.printSymTableAllDetail(symRoot.symTableRoot);
38      symRoot.symTableConst.printSymTableDetail();
39    }
The instance root made at the line 1 has the root information of the backend process. It has the information of the compiler options (spec) and output assembly file (System.out). The backend processes are called at the line 7, 20, and 33 as:
Module unit = Module.loadSLir(sexp, targetName, targetConvention, root);
unit.basicOptimization();
unit.generateCode(out);
The instance unit has the internal representaion of the LIR made from the S-expression sexp.

The above three methods, loadSLir, basicOptimizations, and generateCode, are composed of several steps as follows:

loadSLir
Module unit = new Module(sexp, targetName, convention, root);
unit.apply(new String[] {
    "IntroVirReg",
    "EarlyRewriting",
    "+AfterEarlyRewriting",
    "If2Jumpc" });
return unit;
basicOptimization
unit.apply(new String[] {
    "+BeforeBasicOpt",
    "SimpleOpt",
    "JumpOpt",
    "PreHeaders",
    "LoopInversion",
    "+AfterBasicOpt" });
generateCode
unit.apply(new String[] {
    "+BeforeCodeGeneration",
    "LateRewriting",
    "+AfterLateRewriting",
    "ToMachineCode",
    "+AfterToMachineCode",
    "ConvToAsm"  });
For example, by the first
unit.apply("IntroVirReg");
the method doIt in the class coins.backend.opt.IntroVirReg is executed, because the Module constructor has registered the transformer IntroVirReg.trig with the name "IntroVirReg" to the instance root by the following statement:
root.registerTransformer(IntroVirReg.trig);
By the execution of doIt of coins.backend.opt.IntroVirReg, frame variables are named as virtual registers.

Furthermore, the execution of

unit.apply("ToMachineCode");
in generateCode causes the excecution of the instance method toMachineCode of the class Function, and the toMachineCode method executes the following steps:
    apply(LiveRange.trig);   // live range analysis
    apply(JumpOpt.trig);   // jump optimization
    apply(JumpCanon.trig);   // canonicalize JUMPC instructions
    apply("+BeforeFirstInstSel");   // call Hook point
    apply(module.targetMachine.instSelTrig);   // instruction selection
    apply("+AfterFirstInstSel");  // call Hook point
    int count = 0;
    for (;;) {
      if (count++ >= REGALOLIMIT)
        throw new CantHappenException("Looks like an infinite loop during Register allocation");
      if (apply(RegisterAllocation.trig))  // register allocation
        break;          //  if success
      apply("+BeforeSecondInstSel");  // call Hook point
      apply(module.targetMachine.instSelTrig);   // instruction selection again
      apply("+AfterSecondInstSel");  // call Hook point
    }
    postProcess(); // eliminate redundant jumps
The argument string with prefixed "+" in the above calls of the apply method means that it is a hook name. Usually this call does nothing. But, if the method
 Root#addHook(String, Transformer)
has been called with a hook name and a Transformer object, the doIt method of the Transformer object is executed when the apply method is called with the hook name.

11.2. How to Add New Process to Backend

By using the previous hooking features a new process can be added to the backend without modifing the backend. The class for a new process should be in the following form:
class NewProcess {
  class NewTrigger  implements LocalTransformer{ // or GlobalTransformer
    void doIt(...){...}
    ...
  }
  static void attach(CompileSpecification spec, Root root){
//  if (spec.getCoinsOptions().isSet("newprocess")) {
      root.addHook("+xxx", new NewTrigger());
// }
  }
}
where "+xxx" is a hook name. LocalTransformer is applied to each Function instance, and GlobalTransformer is applied to a Module instance.

When the compiler option

-coins:attach=NewProcess
is specified, the NewProcess.attach method is called at the entry of the backend process. And the doIt method of the NewTrigger object is executed at the hook point "+xxx".
If the commented-out if-statement is effective, this execution occurs only when the option
-coins:newprocess
is specified.

11.2.1. Example 1: Instruction Scheduler

The instruction scheduler is called when the following compiler options are specified:
-coins:schedule,attach=coins.backend.sched.Schedule
or
-coins:attach=coins.backend.sched.Schedule
The attach method of coins.backend.sched.Schedule is defined as follows:
public static void attach(CompileSpecification spec, Root root) {
	root.addHook("+AfterToMachineCode", after);
    if (spec.getCoinsOptions().isSet("schedule")) {
      root.addHook("+AfterFirstInstSel", before);
    }
  }
Therefore, if the "schedule" option is specified, the instruction scheduler is called twice by two triggers, one is the before trigger at the "+AfterFirstInstSel" hook point, and the other is the after trigger at the "+AfterToMachineCode" hook point. The former point is before the register allocation, and the latter is after the regisiter allocation. If the "schedule" option is not specified, the instruction sheduler is called only once after the regisiter allocation.

The two triggers are almost the same except that the latter does the "nop"-instruction elimination if the target CPU has delayed branch instructions. They are written machine independently, and can be applied to any target machine.

See Fig.12-2 in 12. Performance of COINS compilers for some resultsof this optimization.

11.2.2. Example 2: Software Pipelining

Software pipelining is an optimization that can improve the loop-executing performance by taking advantage of the instruction-level parallelism. However, the conventional software pipelining method has several problems. It can only be applied to counter controled loops. In some cases the loop control instruction should be modified. It can not be applied if the number of iterations is smaller than the number of stages of pipelining.

Our method solves all these problems. Our method first schedules the loop control instruction and its dependent instructions, and then schedules the remaining instructions for pipelining.

The software pipelining module is called when the following compiler options are specified:
-coins:schedule,pipelining,attach=coins.backend.sched.Schedule

For example, the following source program

	for (i = 0; i < N; i++)
		for (j = 0; j < M; j++) {
			s = 0;
			for (k = 0; k < L; k++) 
				s += a[i][k] * b[k][j];
			c[i][j] = s;
		}

is software pipelined as follows.

By specifying the SSA-optimizations as

ssa-opt=prun/divex/cse/cstp/hli/osr/hli/cstp/cpyp/preqp/cstp/rpe/dce/srd3
and by specifying the target machine as SPARC, the generated LIR codes for the above program looks like (in assembly form)
.L4:
	ld	[%i5],%f1       ; %f1 = a[i][k]
	ld	[%i4],%f0	; %f0 = b[k][j]
	fmuls	%f1,%f0,%f0	; %f0 = %f1*%f0 = a[i][k]*b[k][j]
	fadds	%f2,%f0,%f2	; %f2 = s = %f2+%f0 = s + a[i][k]*b[k][j]
	add	%i4,2000,%i4	; %i4 = &b[k][j] + 2000 = &b[k+1][j]
	add	%i5,4,%i5	; %i5 = &a[i][k] + 4 = &a[i][k+1] 
	cmp	%i4,%o2      	; &b[k+1][j] : &b[L][j] 
	bl	.L4		; k+1 < L
	nop
The result of software pipelining for this LIR is as follows:
.L4:	ld	[%i4],%f0
	add	%i4,2000,%i4                               ; P
	cmp	%i4,%o2                                    ; R
	bge	.L7                                        ; O
	ld	[%i5],%f1                                  ; L
	fmuls   %f1,%f0,%f3                                ; O		
	add	%i5,4,%i5	                           ; G
				 ld    [%i4],%f0           ; U
				 add   %i4,2000,%i4        ; E
				 cmp   %i4,%o2
				 bge   .L6
				 ld    [%i5],%f1
----------------------------------------------
.L5:	fadds  %f2,%f3,%f2
				 fmuls  %f1,%f0,%f3                            ; K
				 add    %i5,4,%i5                              ; E
							   ld	 [%i4],%f0     ; R
							   add  %i4,2000,%i4   ; N
							   cmp  %i4,%o2        ; E
							   bl   .L5            ; L
							   ld   [%i5],%f1
----------------------------------------------
.L6:				 fadds %f2,%f3,%f2                             ; EPIL-
.L7:							   fmuls %f1,%f0,%f3   ; OGUE
							   add   %i5,4,%i5
							   fadds %f2,%f3,%f2
In this example, the loop control instruction and its dependent instructions are scheduled at first.
	ld	[%i4],%f0     ; stage 0
	add	%i4,2000,%i4  ; stage 0
	cmp	%i4,%o2       ; stage 0
	bl	.L4           ; stage 0
	nop	              ; stage 0

Then, by scheduling the remaining instructions for pipelining the kernel loop becomes

	fadds	%f2,%f0,%f2     ; stage 2
	fmuls	%f1,%f0,%f0     ; stage 1
	add	%i5,4,%i5       ; stage 1
	ld	[%i5],%f1       ; stage 0
	ld	[%i4],%f0       ; stage 0
	add	%i4,2000,%i4    ; stage 0
	cmp	%i4,%o2         ; stage 0
	bl	.L4             ; stage 0
	nop	                ; stage 0

In our method, all "stage 0" instructions, including the branch instruction, are placed in the prologue part.

Finally, after the phase of regster allocation, the nop instruction is replaced with the "ld [%i5],%f1" instruction by instruction scheduling.

11.2.3. Example 3: Register Promotion

The process of register promotion is called when the following compiler options are specified:
-coins:regpromote,attach=RegPromote
The class RegPromote is in the package coins.backend.contrib. The name of this special package can be omitted.

The register promotion is the process to promote static variables to registers within some loops.

The compiler option

-coins:regpromote-ex
invokes the process of extended register promotion that promotes more variables to registers based on pointer analysis.

The register promotion is effective especially for such a program as:

int candidat, quotient, remaindr, index, nth, primenum, loopend ;
int primeNumbers[SIZE] ;

void getPrime(int primevec[SIZE], int count)
{
 primevec[0] = 2 ;
 nth = 1 ;
 candidat = 3 ;
 while (nth<count) {
   remaindr = 1 ;
   index = 0 ;
   loopend = 0 ;
   while(loopend==0) {
     primenum = primevec[index] ;
     quotient = candidat / primenum ;
     remaindr = candidat - quotient*primenum ;
     if (remaindr==0)
       loopend = 1 ;
     if (quotient*quotient<candidat)
       loopend = 1 ;
     index = index + 1 ;
     if (index>=nth)
       loopend = 1 ;
   }
   if (remaindr != 0) {
     primevec[nth] = candidat ;
     nth = nth + 1 ;
   }
   candidat = candidat + 2 ;
 }
 nth   = 0 ;
 while (nth<count) {
/*   print(primevec[nth]) ;
*/
   nth   = nth + 1 ;
 }
 return;
}
This program is the prime program in the Fig.12-2 of the chapter 12. Performance of COINS compilers. This register promotion produces almost all the performance gain of the prime program.