DEC(HP) Motion Video Instruction Extensions for Alpha MIPS MIPS 3D SUN Visual Instruction Set Motorola Velocity Engine (AltiVec) Intel MMX,SSE,SSE2,SSE3Currently, utilization of SIMD instructions through compilers are very limited. In many cases, programmers must resort to use assembly languages or intrinsic routines when he is to exploit SIMD instruction sets.
We have developed program code analysis and transformation methods which enables the compiler to generate efficient code including SIMD instructions for those applications shown above, and have implemented it in our infrastructure. We call such transformation as SIMD parallelization or SIMD optimization.
To exploit features of SIMD instruction sets which have generally low parallelism but low execution cost, schedulings of instructions including conventional ones are needed. And the compiler must analyse dependencies between instructions precisely.
Then, we made precise and systematic description of functions of the instructions including conventional ones for following instruction set architecture. Really, they describe precise behaviour of instruction execution such as change of condition flags and behaviour on overflow.
For example, "addcc" instruction of SPARC v8 is described as Fig.9-1.
;; == Add and modify icc == ;; addcc reg, reg_or_imm, reg (DEFINST ("addcc ?1r, ??2R, ?3r" "addcc ?1r, ??2R, ?3r") (PARALLEL (SET (REG I32 (HOLE 3 IREG_D)) (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2))) (SET (REG I1 icc_n) (TSTLTS I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_z) (TSTEQ I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_v) (OVERFLOW I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) (SET (REG I1 icc_c) (CARRY I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) ) ) ;; in case of destination=g0 (DEFINST ("addcc ?1r, ??2R, %g0" "addcc ?1r, ??2R, %g0") (PARALLEL (SET (REG I1 icc_n) (TSTLTS I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_z) (TSTEQ I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_v) (OVERFLOW I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) (SET (REG I1 icc_c) (CARRY I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) ) )
Instruction Set | Lines | Bytes |
---|---|---|
SPARC Version 8 | 4044 | 100133 |
SPARC Version 9 (including VIS ) | 9045 | 245017 |
IA-32(including MMX/ SSE/ SSE2/ Enhanced 3DNow!) | 31090 | 1276309 |
PowerPC (including AltiVec) | 20796 | 827171 |
We assume that comparing operations in SIMD instruction set produce all 1 bit patterns for true or all 0 patterns for false, in other words, they produce bit-mask for the result as illustrated in second example of Fig.9-2. Most of SIMD instruction sets use this manner. Our if-conversion algorithm exploits this feature. This semantics for boolean value differs from that of original LIR, then we use it only in the SIMD optimization phase.
We decided in this project to show step-wise mile stones for SIMD compiler optimization, and following items for COINS SIMD optimization.
While vectorization and complex if-conversion are important items in SIMD optimization for real source code, most of them can be written in source code level. Such transformations and optimizations should be done in HIR level under the policy of COINS. Currently, they are separated and postponed to the future work.
SIMD optimizer decides do or not do of SIMD optimization for each sub-program (or function) automatically. This decision should be done in more fine grain such as loops, and it is also a future work.
For example, current COINS cannot emit SIMD instructions for case-A and -B of Fig.9-3, but can do for case-C, -D, and -E. Compilers which connects SIMD optimization to vector operations may not be able to generate SIMD instructions for case-E.
#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1)) struct { short r, g, b, a; } *u1, *u2, *u3; short *v1, *v2, *v3; /* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; } for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u3[i].r); u1[i].g = AVE(u2[i].g, u3[i].g); u1[i].b = AVE(u2[i].b, u3[i].b); u1[i].a = AVE(u2[i].a, u3[i].a); }
static unsigned char sa,sr,sg,sb; static short da,dr,dg,db; static short k; static void hLineRight(unsigned char *p,int n, unsigned char a,short ea, unsigned char r,short er, unsigned char g,short eg, unsigned char b,short eb) { while(n!=0) { *p++=b; *p++=g; *p++=r; *p++=a; a+=sa; r+=sr; g+=sg; b+=sb; if((ea+=da)>=0) { a++; ea-=k; } if((er+=dr)>=0) { r++; er-=k; } if((eg+=dg)>=0) { g++; eg-=k; } if((eb+=db)>=0) { b++; eb-=k; } --n; } }
Loop kernel of the example is transformed into the LIR in Fig.9-5.
(SET (REG I8 t0) (ADD I8 (REG I8 t0) (REG I8 t4))) ; a+=sa; (SET (REG I8 t1) (ADD I8 (REG I8 t1) (REG I8 t5))) ; r+=sr; (SET (REG I8 t2) (ADD I8 (REG I8 t2) (REG I8 t6))) ; g+=sg; (SET (REG I8 t3) (ADD I8 (REG I8 t3) (REG I8 t7))) ; b+=sb; (SET (REG I16 t8) (ADD I16 (REG I16 t8) (REG I16 t12))) ; ea+=da; (JUMP2 (TSTGE I1 (REG I16 t8) (INTCONST I16 0)) (LABEL L0) (LABEL L4)) ; if() (LABEL L0) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (INTCONST I8 1))) ; a++; (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t28))) ; ea-=k; (LABEL 4) (SET (REG I16 t9) (ADD I16 (REG I16 t9) (REG I16 t13))) (JUMP2 (TSTGE I1 (REG I16 t9) (INTCONST I16 0)) (LABEL L1) (LABEL L5)) (LABEL L1) (SET (REG I8 t1) (ADD I8 (REG I8 t1) (INTCONST I8 1))) (SET (REG I16 t9) (SUB I16 (REG I16 t9) (REG I16 t28))) (LABEL 5) (SET (REG I16 t10) (ADD I16 (REG I16 t10) (REG I16 t14))) (JUMP2 (TSTGE I1 (REG I16 t10) (INTCONST I16 0)) (LABEL L2) (LABEL L6)) (LABEL L2) (SET (REG I8 t2) (ADD I8 (REG I8 t2) (INTCONST I8 1))) (SET (REG I16 t10) (SUB I16 (REG I16 t10) (REG I16 t28))) (LABEL 6) (SET (REG I16 t11) (ADD I16 (REG I16 t11) (REG I16 t15))) (JUMP2 (TSTGE I1 (REG I16 t11) (INTCONST I16 0)) (LABEL L3) (LABEL L7)) (LABEL L3) (SET (REG I8 t3) (ADD I8 (REG I8 t3) (INTCONST I8 1))) (SET (REG I16 t11) (SUB I16 (REG I16 t11) (REG I16 t28))) (LABEL 7)
SIMD parallelization consists of following steps:
(JUMP2 (TSTGE I1 (REG I16 t8) (INTCONST I16 0)) (LABEL L0) (LABEL L4)) (LABEL L0) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (INTCONST I8 1))) (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t28))) (LABEL 4)is if-converted to
(SET (REG I16 t16) (TSTGE I16 (REG I16 t8) (INTCONST I16 0))) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (NEG I8 (CONVIT I8 (REG I16 t16))))) ; exploits return value -1/0 of TSTGE (SET (REG I16 t8) (SUB I16 (REG I16 t8) (BAND I16 (REG I16 t28) (REG I16 t16))))
Our if-conversion exploits the result of comparison operator as not only masks but also values of -1/0. Then-part and else-part of the if-statements must be single basic-blocks to be the candidates for if-converted as shown in (a) of Fig.9-7. If statement of (b) can also be if-converted when if-statement surrounded by the box with dashed line can be if-converted. The optimizer has a constant (= 2) for levels to stop such conversion for nested if-statements.
(SET (REG I16 t16) (TSTGE I16 (REG I16 t8) (INTCONST I16 0))) (SET (REG I8 t32) (CONVIT I8 (REG I16 t16))) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (NEG I8 (REG I8 t32)))) ;matched to SUB (SET (REG I16 t31) (BAND I16 (REG I16 t28) (REG I16 t16))) (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t31)))
; Get maximum number (BOR I8 (BAND I8 (HOLE 1 I8) (TSTGES I8 (HOLE 1 I8) (HOLE 2 I8))) (BAND I8 (HOLE 2 I8) (BNOT I8 (TSTGES I8 (HOLE 1 I8) (HOLE 2 I8))))) ; Get average (RSHS I8 (ADD I8 (ADD I8 (HOLE 1 I8) (HOLE 2 I8)) (INTCONST I8 1)) (INTCONST I8 1))
(PARALLEL (SET (REG I8 t0) (ADD I8 (REG I8 t0) (REG I8 t4))) (SET (REG I8 t1) (ADD I8 (REG I8 t1) (REG I8 t5))) (SET (REG I8 t2) (ADD I8 (REG I8 t2) (REG I8 t6))) (SET (REG I8 t3) (ADD I8 (REG I8 t3) (REG I8 t7)))) (PARALLEL (SET (REG I16 t8) (ADD I16 (REG I16 t8) (REG I16 t12))) (SET (REG I16 t9) (ADD I16 (REG I16 t9) (REG I16 t13))) (SET (REG I16 t10) (ADD I16 (REG I16 t10) (REG I16 t14))) (SET (REG I16 t11) (ADD I16 (REG I16 t11) (REG I16 t15)))) (PARALLEL (SET (REG I16 t16) (TSTGE I16 (REG I16 t8) (INTCONST I16 0))) (SET (REG I16 t17) (TSTGE I16 (REG I16 t9) (INTCONST I16 0))) (SET (REG I16 t18) (TSTGE I16 (REG I16 t10) (INTCONST I16 0))) (SET (REG I16 t19) (TSTGE I16 (REG I16 t11) (INTCONST I16 0)))) (PARALLEL (SET (REG I8 t32) (CONVIT I8 (REG I16 t16))) (SET (REG I8 t34) (CONVIT I8 (REG I16 t17))) (SET (REG I8 t36) (CONVIT I8 (REG I16 t18))) (SET (REG I8 t38) (CONVIT I8 (REG I16 t19)))) (PARALLEL (SET (REG I8 t0) (SUB I8 (REG I8 t0) (REG I8 t32))) (SET (REG I8 t1) (SUB I8 (REG I8 t1) (REG I8 t34))) (SET (REG I8 t2) (SUB I8 (REG I8 t2) (REG I8 t36))) (SET (REG I8 t3) (SUB I8 (REG I8 t3) (REG I8 t38)))) (PARALLEL (SET (REG I16 t31) (BAND I16 (REG I16 t28) (REG I16 t16))) (SET (REG I16 t33) (BAND I16 (REG I16 t28) (REG I16 t17))) (SET (REG I16 t35) (BAND I16 (REG I16 t28) (REG I16 t18))) (SET (REG I16 t37) (BAND I16 (REG I16 t28) (REG I16 t19)))) (PARALLEL (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t31))) (SET (REG I16 t9) (SUB I16 (REG I16 t9) (REG I16 t33))) (SET (REG I16 t10) (SUB I16 (REG I16 t10) (REG I16 t35))) (SET (REG I16 t11) (SUB I16 (REG I16 t11) (REG I16 t37))))
(PARALLEL (SET (SUBREG I8 (REG I32 m0) 0) (ADD I8 (SUBREG I8 (REG I32 m0) 0) (SUBREG I8 (REG I32 m1) 0))) (SET (SUBREG I8 (REG I32 m0) 1) (ADD I8 (SUBREG I8 (REG I32 m0) 1) (SUBREG I8 (REG I32 m1) 1))) (SET (SUBREG I8 (REG I32 m0) 2) (ADD I8 (SUBREG I8 (REG I32 m0) 2) (SUBREG I8 (REG I32 m1) 2))) (SET (SUBREG I8 (REG I32 m0) 3) (ADD I8 (SUBREG I8 (REG I32 m0) 3) (SUBREG I8 (REG I32 m1) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m2) 0) (ADD I16 (SUBREG I16 (REG I64 m2) 0) (SUBREG I16 (REG I64 m3) 0))) (SET (SUBREG I16 (REG I64 m2) 1) (ADD I16 (SUBREG I16 (REG I64 m2) 1) (SUBREG I16 (REG I64 m3) 1))) (SET (SUBREG I16 (REG I64 m2) 2) (ADD I16 (SUBREG I16 (REG I64 m2) 2) (SUBREG I16 (REG I64 m3) 2))) (SET (SUBREG I16 (REG I64 m2) 3) (ADD I16 (SUBREG I16 (REG I64 m2) 3) (SUBREG I16 (REG I64 m3) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m4) 0) (TSTGES I16 (SUBREG I16 (REG I64 m2) 0) (INTCONST I16 0))) (SET (SUBREG I16 (REG I64 m4) 1) (TSTGES I16 (SUBREG I16 (REG I64 m2) 1) (INTCONST I16 0))) (SET (SUBREG I16 (REG I64 m4) 2) (TSTGES I16 (SUBREG I16 (REG I64 m2) 2) (INTCONST I16 0))) (SET (SUBREG I16 (REG I64 m4) 3) (TSTGES I16 (SUBREG I16 (REG I64 m2) 3) (INTCONST I16 0))) ) (PARALLEL (SET (SUBREG I8 (REG I32 m5) 0) (CONVIT I8 (SUBREG I16 (REG I64 m4) 0))) (SET (SUBREG I8 (REG I32 m5) 1) (CONVIT I8 (SUBREG I16 (REG I64 m4) 1))) (SET (SUBREG I8 (REG I32 m5) 2) (CONVIT I8 (SUBREG I16 (REG I64 m4) 2))) (SET (SUBREG I8 (REG I32 m5) 3) (CONVIT I8 (SUBREG I16 (REG I64 m4) 3))) ) (PARALLEL (SET (SUBREG I8 (REG I32 m0) 0) (SUB I8 (SUBREG I8 (REG I32 m0) 0) (SUBREG I8 (REG I32 m5) 0))) (SET (SUBREG I8 (REG I32 m0) 1) (SUB I8 (SUBREG I8 (REG I32 m0) 1) (SUBREG I8 (REG I32 m5) 1))) (SET (SUBREG I8 (REG I32 m0) 2) (SUB I8 (SUBREG I8 (REG I32 m0) 2) (SUBREG I8 (REG I32 m5) 2))) (SET (SUBREG I8 (REG I32 m0) 3) (SUB I8 (SUBREG I8 (REG I32 m0) 3) (SUBREG I8 (REG I32 m5) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m4) 0) (BAND I16 (SUBREG I16 (REG I64 m4) 0) (SUBREG (REG I64 m7) 0))) (SET (SUBREG I16 (REG I64 m4) 1) (BAND I16 (SUBREG I16 (REG I64 m4) 1) (SUBREG (REG I64 m7) 1))) (SET (SUBREG I16 (REG I64 m4) 2) (BAND I16 (SUBREG I16 (REG I64 m4) 2) (SUBREG (REG I64 m7) 2))) (SET (SUBREG I16 (REG I64 m4) 3) (BAND I16 (SUBREG I16 (REG I64 m4) 3) (SUBREG (REG I64 m7) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m2) 0) (SUB I16 (SUBREG I16 (REG I64 m2) 0) (SUBREG I16 (REG I64 m4) 0))) (SET (SUBREG I16 (REG I64 m2) 1) (SUB I16 (SUBREG I16 (REG I64 m2) 1) (SUBREG I16 (REG I64 m4) 1))) (SET (SUBREG I16 (REG I64 m2) 2) (SUB I16 (SUBREG I16 (REG I64 m2) 2) (SUBREG I16 (REG I64 m4) 2))) (SET (SUBREG I16 (REG I64 m2) 3) (SUB I16 (SUBREG I16 (REG I64 m2) 3) (SUBREG I16 (REG I64 m4) 3))) )
A set of SET expressions which refers a continuous memory location is the first candidate for combining, and arrange SET expressions so that their source/destination memory references are aligned in a PARALLEL expression. This arrangement decides the order of SET expressions in other PARALLEL expression through define/use chains of registers. It uses BONE patterns for gathering other SET expressions. A BONE represents a SIMD instruction: it consists of operation pattern of target SIMD instruction which is mostly equivalent to the BOP, parallelism of the operation, commutativity between registers, and constraints for registers. We can describe whether the target machine employs 2-address or 3-address in the BONE. In this step, SET expressions combined with a PARALLEL expression must be so moved that the meaning of the program is not changed. with use of control flow information.
movq (%edi),%mm1 punpcklwd (%edi),%mm1 # get lower movq (%esi),%mm2 punpcklwd (%esi),%mm2 movq %mm2,%mm3 psrad $16,%mm1 por %mm1,%mm3 psrad $16,%mm2 pand %mm0,%mm3 # %mm0 == 1(constant) movq %mm2,%mm3 psraw $1,%mm1 por %mm1,%mm3 psraw $1,%mm2 pand %mm0,%mm3 # %mm0 == 1 paddw %mm1,%mm3 psrad $1,%mm1 paddw %mm2,%mm3 psrad $1,%mm2 movq %mm3,(%eax) paddd %mm1,%mm2 paddd %mm3,%mm2 # lower result (a) Programmer expects ... punpckhwd (%edi),%mm4 # get higher punpckhwd (%esi),%mm3 psrad $16,%mm4 psrad $16,%mm3 movq %mm3,%mm1 por %mm4,%mm1 pand %mm0,%mm1 psrad $1,%mm4 psrad $1,%mm3 paddd %mm4,%mm3 paddd %mm1,%mm3 # higher result packssdw %mm3,%mm2 movq %mm2,(%eax) (b) Integral promotion rule is observed ...
Actually, code (b) is about 2.7 times slower than (a) in our experimentation using Pentium-M. We remarked this problem and presented a systematic solution[Suzuki]. This method can determine meaningful bits in each data path, where the word meaningful means that intended bits affect the result of a computation. Operations with constants often yield meaningless bits in storage word. [Stephenson] presented a similar analysis, but it gives only necessary word size. [Fisher] points out the necessity for such analysis, but presented no solution.
type valueRange = record lo, up, sz: integer end; meaningfulBits = integer; {substitution for set of bits} procedure inferNode(top: expression) begin top.valueRange := bottomupAanalysis(top.rightValue); topdownAnalysis( meaningfulBits(top.leftValue), top.rightValue); top.meaningfulBits := meaningfulBits(top.leftValue); end; function bottomupAnalysis(n: expression): valueRange begin n.valueRange := bottomupSchema( n.kind, bottomupAnalysis(n.child_1), bottomupAnalysis(n.child_2), ... bottomupAnalysis(n.child_n);) end; procedure topdownAnalysis(w: meaningfulBits, n: expression) begin foreach (i in n.childs) begin topdownAnalysis( topdownSchema( w, n.child_1.valueRange, n.child_2.valueRange, ... n.child_n.valueRange), n.child_i); end n.lv = n.lv | (w & rangeToMask(n.valueRange)); end;
In default COINS settings, x86simd code generation is switched-off. You must make appropriate modification for src/coins/backend/gen/Makefile to switch-on it.
Furthermore, the benchmark is available for the following purposes.
/* Multiplier table for division */ const unsigned int multipliers[32] ={ 0,32767,16385,10923,8193,6554,5462,4682, 4097,3641,3277,2979,2731,2521,2341,2185, 2049,1928,1821,1725,1639,1561,1490,1425, 1366,1311,1261,1214,1171,1130,1093,1058 }; unsigned int quant5 ( int16_t * coeff, const int16_t * data, const unsigned int quant ) { const unsigned int mult = multipliers[quant]; const unsigned short quant_m_2 = quant << 1; const unsigned short quant_d_2 = quant >> 1; int sum = 0; unsigned int i; for (i = 0; i < M; i++) { int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue; } acLevel = (acLevel * mult) >> 16; // division sum += acLevel; // sum += |acLevel| coeff[i] = -acLevel; } else { acLevel -= quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue; } acLevel = (acLevel * mult) >> 16; sum += acLevel; coeff[i] = acLevel; } } return sum; }
This example was extracted from a license free implementation MPEG-4 encoder. While current working compilers cannot translate this function to optimized code in which the SIMD instructions are efficiently used, human can do so by hand. There are some gaps between this example and the goal.
This example should be processed in 16 bits (i.e. data size for short) except for the division part which does multiply and right shift 16 bits. In normal language specifications, we cannot instruct it to the compiler. With data size analysis or some strict pattern matching, the compiler can recognize appropriate processing data size. Otherwise, this computation must be done in 32 bits (size of integral type) of data size. It is very hard to generate appropriate SIMD code for the original source code, after all.
We can transform loop kernel of this example to easier form as shown in Fig.9-16. They may be considered as a result of some step in SIMD code generation process, and it is quite possible that there may be several versions derived from the same original coding. In this point of view, each of original coding patterns was transformed by hand yielding multiple versions. They cover wide range of patterns from easily SIMD optimizable level to hardly optimizable level.
/* Transformation 1 */ const unsigned short mult = multipliers[quant]; int16_t acLevel, acLevel2; acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2; acLevel2 = (acLevel * mult) >> SCALEBITS; sum += ((acLevel < quant_m_2) ? 0 : acLevel2); coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2)); /* Transformation 2 */ const unsigned short mult = multipliers[quant]; int16_t acMsk1, acMsk2; int16_t acLevel; acMsk1 = (data[i] < 0) ? -1 : 0; acLevel = ((data[i] & acMsk1)|((-data[i]) & acMsk1)) - quant_d_2; acMsk2 = (acLevel < quant_m_2) ? -1 : 0; acLevel = (acLevel * mult) >> SCALEBITS; sum += acMsk2 & acLevel; coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (acMsk1)));