SIMD型拡張命令セットの名称 | 実装しているプロセッサ |
---|---|
SUN: VIS(TM) Instruction Set | UltraSPARC-I以降 |
Compaq (DEC) Motion: Video Instructions | Alpha 21264 |
Intel: MMX(R)テクノロジ (MMX Technology) | MMXテクノロジ対応Pentium Processor, Pentium II 以降 |
Intel: ストリーミングSIMD拡張命令 (Streaming SIMD Extensions) | Pentium III 以降 |
Intel: Streaming SIMD Extentions 2 | Pentium 4 |
AMD: 3DNow!(TM)テクノロジ | K6-2以降 |
AMD: Enhanced 3DNow!(TM) Technology | Athlon, Duron |
Motorola: AltiVec(TM) Technology | MPC7400, MPC7410 (PowerPC G4) |
日立: 浮動小数点グラフィック強化命令 | SH7750 |
MIPS: MIPS V ISA Extension (現MIPS_64の一部) | R5000 |
MIPS: MDMX (Mips Digital Media Extension) | MIPS64 5Kc |
MIPS: MIPS-3D(TM) ASE (Application Specific Extension) | MIPS64 R20K, MIPS64 20Kc |
SONY: Emotion Engine コア命令セット | PlayStation2 |
;; == Add and modify icc == ;; addcc reg, reg_or_imm, reg (DEFINST ("addcc ?1r, ??2R, ?3r" "addcc ?1r, ??2R, ?3r") (PARALLEL (SET (REG I32 (HOLE 3 IREG_D)) (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2))) (SET (REG I1 icc_n) (TSTLTS I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_z) (TSTEQ I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_v) (OVERFLOW I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) (SET (REG I1 icc_c) (CARRY I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) ) ) ;; destinationがg0の場合はフラグのみ設定する (DEFINST ("addcc ?1r, ??2R, %g0" "addcc ?1r, ??2R, %g0") (PARALLEL (SET (REG I1 icc_n) (TSTLTS I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_z) (TSTEQ I1 (ADD I32 (REG I32 (HOLE 1 IREG)) (HOLE 2)) (INTCONST I32 0))) (SET (REG I1 icc_v) (OVERFLOW I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) (SET (REG I1 icc_c) (CARRY I1 (REG I32 (HOLE 1 IREG)) (HOLE 2) (INTCONST I1 0))) ) )
命令セット | 行数 | バイト数 |
---|---|---|
SPARC Version 8 | 4044 | 100133 |
SPARC Version 9 (64ビット, 含 VIS ) | 9045 | 245017 |
IA-32非特権命令(含MMX/ SSE/ SSE2/ Enhanced 3DNow!) | 31090 | 1276309 |
PowerPC (含 AltiVec) | 20796 | 827171 |
これらは、COINSのソースアーカイブの doc-ja/coins/instDesc以下にテスト系や結果と共に置かれている。 詳しくは、各記述中のコメントと、README.TXTを参照されたい。
(BOR I8 (BAND I8 (HOLE 1 I8) (TSTGES I8 (HOLE 1 I8) (HOLE 2 I8))) (BAND I8 (HOLE 2 I8) (BNOT I8 (TSTGES I8 (HOLE 1 I8) (HOLE 2 I8)))))
( ((16 8) 1 nil) ; Bone情報 (SET I8 (HOLE 0 I8) ; Boneパターン (BOR I8 (BAND I8 (HOLE 1 I8) (TSTGES I8 (HOLE 1 I8) (HOLE 2 I8))) (BAND I8 (HOLE 2 I8) (BNOT I8 (TSTGES I8 (HOLE 1 I8) (HOLE 2 I8)))))) )
図8.2-4のソースプログラムに上記の各処理を施した様子を以下に示す。 ただし、上記の2.の処理は省略する。 このソースプログラムは、コンピュータグラフィクスの例題から得た。
static unsigned char sa,sr,sg,sb; static short da,dr,dg,db; static short k; static void hLineRight(unsigned char *p,int n, unsigned char a,short ea, unsigned char r,short er, unsigned char g,short eg, unsigned char b,short eb) { while(n!=0) { *p++=b; *p++=g; *p++=r; *p++=a; a+=sa; r+=sr; g+=sg; b+=sb; if((ea+=da)>=0) { a++; ea-=k; } if((er+=dr)>=0) { r++; er-=k; } if((eg+=dg)>=0) { g++; eg-=k; } if((eb+=db)>=0) { b++; eb-=k; } --n; } }
その中の色のついた部分がSIMD並列化の対象である。 その部分に対応するLIRは図8.2-5のようになる。
(SET (REG I8 t0) (ADD I8 (REG I8 t0) (REG I8 t4))) ; a+=sa; (SET (REG I8 t1) (ADD I8 (REG I8 t1) (REG I8 t5))) ; r+=sr; (SET (REG I8 t2) (ADD I8 (REG I8 t2) (REG I8 t6))) ; g+=sg; (SET (REG I8 t3) (ADD I8 (REG I8 t3) (REG I8 t7))) ; b+=sb; (SET (REG I16 t8) (ADD I16 (REG I16 t8) (REG I16 t12))) ; ea+=da; (JUMP2 (TSTGE I1 (REG I16 t8) (INTCONST I16 0)) (LABEL L0) (LABEL L4)) ; if() (LABEL L0) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (INTCONST I8 1))) ; a++; (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t28))) ; ea-=k; (LABEL 4) (SET (REG I16 t9) (ADD I16 (REG I16 t9) (REG I16 t13))) (JUMP2 (TSTGE I1 (REG I16 t9) (INTCONST I16 0)) (LABEL L1) (LABEL L5)) (LABEL L1) (SET (REG I8 t1) (ADD I8 (REG I8 t1) (INTCONST I8 1))) (SET (REG I16 t9) (SUB I16 (REG I16 t9) (REG I16 t28))) (LABEL 5) (SET (REG I16 t10) (ADD I16 (REG I16 t10) (REG I16 t14))) (JUMP2 (TSTGE I1 (REG I16 t10) (INTCONST I16 0)) (LABEL L2) (LABEL L6)) (LABEL L2) (SET (REG I8 t2) (ADD I8 (REG I8 t2) (INTCONST I8 1))) (SET (REG I16 t10) (SUB I16 (REG I16 t10) (REG I16 t28))) (LABEL 6) (SET (REG I16 t11) (ADD I16 (REG I16 t11) (REG I16 t15))) (JUMP2 (TSTGE I1 (REG I16 t11) (INTCONST I16 0)) (LABEL L3) (LABEL L7)) (LABEL L3) (SET (REG I8 t3) (ADD I8 (REG I8 t3) (INTCONST I8 1))) (SET (REG I16 t11) (SUB I16 (REG I16 t11) (REG I16 t28))) (LABEL 7)
図8.2-5のブルーの色の中でif文に相当する部分にif変換を施すと、次の図8.2-6のようになる。 ここでは、比較演算の結果は真のとき-1、すなわち全ビットが1、になるとしている。
(JUMP2 (TSTGE I1 (REG I16 t8) (INTCONST I16 0)) (LABEL L0) (LABEL L4)) (LABEL L0) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (INTCONST I8 1))) (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t28))) (LABEL 4)が次のようにif変換される。
(SET (REG I16 t16) (TSTGE I16 (REG I16 t8) (INTCONST I16 0))) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (NEG I8 (CONVIT I8 (REG I16 t16))))) ; exploits return value -1/0 of TSTGE (SET (REG I16 t8) (SUB I16 (REG I16 t8) (BAND I16 (REG I16 t28) (REG I16 t16))))
次に、このif変換の結果を基本操作BOP(Basic Order Pattern)に分解すると図8.2-7のようになる。
(SET (REG I16 t16) (TSTGE I16 (REG I16 t8) (INTCONST I16 0))) (SET (REG I8 t32) (CONVIT I8 (REG I16 t16))) (SET (REG I8 t0) (ADD I8 (REG I8 t0) (NEG I8 (REG I8 t32)))) ;matched to SUB (SET (REG I16 t31) (BAND I16 (REG I16 t28) (REG I16 t16))) (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t31)))
次に同型の命令を寄せ集めて、Boneの表と照合して、マッチしたものをLIRのPARALLEL式で括ると、図8.2-8のようになる。
(PARALLEL (SET (REG I8 t0) (ADD I8 (REG I8 t0) (REG I8 t4))) (SET (REG I8 t1) (ADD I8 (REG I8 t1) (REG I8 t5))) (SET (REG I8 t2) (ADD I8 (REG I8 t2) (REG I8 t6))) (SET (REG I8 t3) (ADD I8 (REG I8 t3) (REG I8 t7)))) (PARALLEL (SET (REG I16 t8) (ADD I16 (REG I16 t8) (REG I16 t12))) (SET (REG I16 t9) (ADD I16 (REG I16 t9) (REG I16 t13))) (SET (REG I16 t10) (ADD I16 (REG I16 t10) (REG I16 t14))) (SET (REG I16 t11) (ADD I16 (REG I16 t11) (REG I16 t15)))) (PARALLEL (SET (REG I16 t16) (TSTGE I16 (REG I16 t8) (INTCONST I16 0))) (SET (REG I16 t17) (TSTGE I16 (REG I16 t9) (INTCONST I16 0))) (SET (REG I16 t18) (TSTGE I16 (REG I16 t10) (INTCONST I16 0))) (SET (REG I16 t19) (TSTGE I16 (REG I16 t11) (INTCONST I16 0)))) (PARALLEL (SET (REG I8 t32) (CONVIT I8 (REG I16 t16))) (SET (REG I8 t34) (CONVIT I8 (REG I16 t17))) (SET (REG I8 t36) (CONVIT I8 (REG I16 t18))) (SET (REG I8 t38) (CONVIT I8 (REG I16 t19)))) (PARALLEL (SET (REG I8 t0) (SUB I8 (REG I8 t0) (REG I8 t32))) (SET (REG I8 t1) (SUB I8 (REG I8 t1) (REG I8 t34))) (SET (REG I8 t2) (SUB I8 (REG I8 t2) (REG I8 t36))) (SET (REG I8 t3) (SUB I8 (REG I8 t3) (REG I8 t38)))) (PARALLEL (SET (REG I16 t31) (BAND I16 (REG I16 t28) (REG I16 t16))) (SET (REG I16 t33) (BAND I16 (REG I16 t28) (REG I16 t17))) (SET (REG I16 t35) (BAND I16 (REG I16 t28) (REG I16 t18))) (SET (REG I16 t37) (BAND I16 (REG I16 t28) (REG I16 t19)))) (PARALLEL (SET (REG I16 t8) (SUB I16 (REG I16 t8) (REG I16 t31))) (SET (REG I16 t9) (SUB I16 (REG I16 t9) (REG I16 t33))) (SET (REG I16 t10) (SUB I16 (REG I16 t10) (REG I16 t35))) (SET (REG I16 t11) (SUB I16 (REG I16 t11) (REG I16 t37))))
これが命令記述のTMDに用意されたエントリーにマッチすると、図8.2-9のようになる。
(PARALLEL (SET (SUBREG I8 (REG I32 m0) 0) (ADD I8 (SUBREG I8 (REG I32 m0) 0) (SUBREG I8 (REG I32 m1) 0))) (SET (SUBREG I8 (REG I32 m0) 1) (ADD I8 (SUBREG I8 (REG I32 m0) 1) (SUBREG I8 (REG I32 m1) 1))) (SET (SUBREG I8 (REG I32 m0) 2) (ADD I8 (SUBREG I8 (REG I32 m0) 2) (SUBREG I8 (REG I32 m1) 2))) (SET (SUBREG I8 (REG I32 m0) 3) (ADD I8 (SUBREG I8 (REG I32 m0) 3) (SUBREG I8 (REG I32 m1) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m2) 0) (ADD I16 (SUBREG I16 (REG I64 m2) 0) (SUBREG I16 (REG I64 m3) 0))) (SET (SUBREG I16 (REG I64 m2) 1) (ADD I16 (SUBREG I16 (REG I64 m2) 1) (SUBREG I16 (REG I64 m3) 1))) (SET (SUBREG I16 (REG I64 m2) 2) (ADD I16 (SUBREG I16 (REG I64 m2) 2) (SUBREG I16 (REG I64 m3) 2))) (SET (SUBREG I16 (REG I64 m2) 3) (ADD I16 (SUBREG I16 (REG I64 m2) 3) (SUBREG I16 (REG I64 m3) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m4) 0) (TSTGES I16 (SUBREG I16 (REG I64 m2) 0) (INTCONST I16 0))) (SET (SUBREG I16 (REG I64 m4) 1) (TSTGES I16 (SUBREG I16 (REG I64 m2) 1) (INTCONST I16 0))) (SET (SUBREG I16 (REG I64 m4) 2) (TSTGES I16 (SUBREG I16 (REG I64 m2) 2) (INTCONST I16 0))) (SET (SUBREG I16 (REG I64 m4) 3) (TSTGES I16 (SUBREG I16 (REG I64 m2) 3) (INTCONST I16 0))) ) (PARALLEL (SET (SUBREG I8 (REG I32 m5) 0) (CONVIT I8 (SUBREG I16 (REG I64 m4) 0))) (SET (SUBREG I8 (REG I32 m5) 1) (CONVIT I8 (SUBREG I16 (REG I64 m4) 1))) (SET (SUBREG I8 (REG I32 m5) 2) (CONVIT I8 (SUBREG I16 (REG I64 m4) 2))) (SET (SUBREG I8 (REG I32 m5) 3) (CONVIT I8 (SUBREG I16 (REG I64 m4) 3))) ) (PARALLEL (SET (SUBREG I8 (REG I32 m0) 0) (SUB I8 (SUBREG I8 (REG I32 m0) 0) (SUBREG I8 (REG I32 m5) 0))) (SET (SUBREG I8 (REG I32 m0) 1) (SUB I8 (SUBREG I8 (REG I32 m0) 1) (SUBREG I8 (REG I32 m5) 1))) (SET (SUBREG I8 (REG I32 m0) 2) (SUB I8 (SUBREG I8 (REG I32 m0) 2) (SUBREG I8 (REG I32 m5) 2))) (SET (SUBREG I8 (REG I32 m0) 3) (SUB I8 (SUBREG I8 (REG I32 m0) 3) (SUBREG I8 (REG I32 m5) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m4) 0) (BAND I16 (SUBREG I16 (REG I64 m4) 0) (SUBREG (REG I64 m7) 0))) (SET (SUBREG I16 (REG I64 m4) 1) (BAND I16 (SUBREG I16 (REG I64 m4) 1) (SUBREG (REG I64 m7) 1))) (SET (SUBREG I16 (REG I64 m4) 2) (BAND I16 (SUBREG I16 (REG I64 m4) 2) (SUBREG (REG I64 m7) 2))) (SET (SUBREG I16 (REG I64 m4) 3) (BAND I16 (SUBREG I16 (REG I64 m4) 3) (SUBREG (REG I64 m7) 3))) ) (PARALLEL (SET (SUBREG I16 (REG I64 m2) 0) (SUB I16 (SUBREG I16 (REG I64 m2) 0) (SUBREG I16 (REG I64 m4) 0))) (SET (SUBREG I16 (REG I64 m2) 1) (SUB I16 (SUBREG I16 (REG I64 m2) 1) (SUBREG I16 (REG I64 m4) 1))) (SET (SUBREG I16 (REG I64 m2) 2) (SUB I16 (SUBREG I16 (REG I64 m2) 2) (SUBREG I16 (REG I64 m4) 2))) (SET (SUBREG I16 (REG I64 m2) 3) (SUB I16 (SUBREG I16 (REG I64 m2) 3) (SUBREG I16 (REG I64 m4) 3))) )
図8.2-9のLIRは、以下のSIMD命令に相当する。この例ではIA-32/MMX命令セットを用いている。
paddb %mm1,%mm0 paddw %mm3,%mm2 pxor %mm4,%mm4 pcmpgtw %mm2,%mm4 ...
プロトタイプ作成当時(平成14年度)は、現在のような命令記述TMD(Target Machine Description)に
基づく命令記述系や低水準中間表現LIRの仕様拡張がなかったので、LIRからSIRへのデータ変換を行って
インフラストラクチャ基盤部との接続を行い、コード生成はSPARC V9/VISをターゲットとした。
(現在は主たるターゲットをIA-32/MMXとしている)
以下の例題ではpdist命令(SIMD命令の1つ)へのマッチングに成功した場合で、
gcc -O2 と比較して 3.5倍(人手で冗長コードを削除した場合で5.6倍)の速度向上を確認した。
詳細については
An example of SIMD Parallelization を参照されたい。
#define ABSDIF(x, y) ((x <= y) ? y-x : x-y) int error(char xx[16][16], char yy[16][16]){ char *across; char *cacross; // int localDiff; int diff=0; int y; for (y=0;y<16;y++) { across = &xx[y][0]; cacross = &yy[y][0]; diff += ABSDIF(across[0], cacross[0]); diff += ABSDIF(across[1], cacross[1]); diff += ABSDIF(across[2], cacross[2]); diff += ABSDIF(across[3], cacross[3]); diff += ABSDIF(across[4], cacross[4]); diff += ABSDIF(across[5], cacross[5]); diff += ABSDIF(across[6], cacross[6]); diff += ABSDIF(across[7], cacross[7]); diff += ABSDIF(across[8], cacross[8]); diff += ABSDIF(across[9], cacross[9]); diff += ABSDIF(across[10], cacross[10]); diff += ABSDIF(across[11], cacross[11]); diff += ABSDIF(across[12], cacross[12]); diff += ABSDIF(across[13], cacross[13]); diff += ABSDIF(across[14], cacross[14]); diff += ABSDIF(across[15], cacross[15]); }; return diff; }一方で、以下の問題が浮上した。
SIMD並列化のモジュール一式は、 ソースアーカイブの src/coins/simd 以下に、c.に関連した「データサイズ推論」を行う解析系と共に置かれている。詳細については SIMD並列化外部仕様書 を参照されたい。
-coins:simdこのオプション指定により、 SIMD並列化部を呼び出す。 現状ではIA32/SSE2に特化したSIMD並列化が実施される。
-coins:target=x86simdこのオプションで、SSE2拡張命令セットを含むLIRに対するコード生成を指示する。
-coins:target=x86simd,simdのように同時に用いる。
ベンチマークの一式は、ソースアーカイブの Test2/TestSIMD/UEC/SIMD_bench 以下に置かれている。 簡単な説明がsimd_bench_doc.txt に書かれている。