_RAMBLING IN REAL TIME_ by Michael Abrash Listing One ; use of fxch to allow addition of first two ; products to start while third multiplication finishes fld [vec0+0] ;starts & ends on cycle 0 fmul [vec1+0] ;starts on cycle 1 fld [vec0+4] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 fld [vec0+8] ;starts & ends on cycle 4 fmul [vec1+8] ;starts on cycle 5 fxch st(1) ;no cost faddp st(2),st(0) ;starts on cycle 6 Listing Two ; unoptimized dot product; 17 cycles fld [vec0+0] ;starts & ends on cycle 0 fmul [vec1+0] ;starts on cycle 1 fld [vec0+4] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 fld [vec0+8] ;starts & ends on cycle 4 fmul [vec1+8] ;starts on cycle 5 ;stalls for cycles 6-7 faddp st(1),st(0) ;starts on cycle 8 ;stalls for cycles 9-10 faddp st(1),st(0) ;starts on cycle 11 ;stalls for cycles 12-14 fstp [dot] ;starts on cycle 15, ; ends on cycle 16 Listing Three ; optimized dot product; 15 cycles fld [vec0+0] ;starts & ends on cycle 0 fmul [vec1+0] ;starts on cycle 1 fld [vec0+4] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 fld [vec0+8] ;starts & ends on cycle 4 fmul [vec1+8] ;starts on cycle 5 fxch st(1) ;no cost faddp st(2),st(0) ;starts on cycle 6 ;stalls for cycles 7-8 faddp st(1),st(0) ;starts on cycle 9 ;stalls for cycles 10-12 fstp [dot] ;starts on cycle 13, ; ends on cycle 14 Listing Four ; unoptimized cross product; 36 cycles fld [vec0+4] ;starts & ends on cycle 0 fmul [vec1+8] ;starts on cycle 1 fld [vec0+8] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 ;stalls for cycles 4-5 fsubrp st(1),st(0) ;starts on cycle 6 ;stalls for cycles 7-9 fstp [vec2+0] ;starts on cycle 10, ; ends on cycle 11 fld [vec0+8] ;starts & ends on cycle 12 fmul [vec1+0] ;starts on cycle 13 fld [vec0+0] ;starts & ends on cycle 14 fmul [vec1+8] ;starts on cycle 15 ;stalls for cycles 16-17 fsubrp st(1),st(0) ;starts on cycle 18 ;stalls for cycles 19-21 fstp [vec2+4] ;starts on cycle 22, ; ends on cycle 23 fld [vec0+0] ;starts & ends on cycle 24 fmul [vec1+4] ;starts on cycle 25 fld [vec0+4] ;starts & ends on cycle 26 fmul [vec1+0] ;starts on cycle 27 ;stalls for cycles 28-29 fsubrp st(1),st(0) ;starts on cycle 30 ;stalls for cycles 31-33 fstp [vec2+8] ;starts on cycle 34, ; ends on cycle 35 Listing Five ; optimized cross product; 22 cycles fld [vec0+4] ;starts & ends on cycle 0 fmul [vec1+8] ;starts on cycle 1 fld [vec0+8] ;starts & ends on cycle 2 fmul [vec1+0] ;starts on cycle 3 fld [vec0+0] ;starts & ends on cycle 4 fmul [vec1+4] ;starts on cycle 5 fld [vec0+8] ;starts & ends on cycle 6 fmul [vec1+4] ;starts on cycle 7 fld [vec0+0] ;starts & ends on cycle 8 fmul [vec1+8] ;starts on cycle 9 fld [vec0+4] ;starts & ends on cycle 10 fmul [vec1+0] ;starts on cycle 11 fxch st(2) ;no cost fsubrp st(5),st(0) ;starts on cycle 12 fsubrp st(3),st(0) ;starts on cycle 13 fsubrp st(1),st(0) ;starts on cycle 14 fxch st(2) ;no cost ;stalls for cycle 15 fstp [vec2+0] ;starts on cycle 16, ; ends on cycle 17 fstp [vec2+4] ;starts on cycle 18, ; ends on cycle 19 fstp [vec2+8] ;starts on cycle 20, ; ends on cycle 21 Listing Six ; optimized transformation: 34 cycles fld [vec0+0] ;starts & ends on cycle 0 fmul [matrix+0] ;starts on cycle 1 fld [vec0+0] ;starts & ends on cycle 2 fmul [matrix+16] ;starts on cycle 3 fld [vec0+0] ;starts & ends on cycle 4 fmul [matrix+32] ;starts on cycle 5 fld [vec0+4] ;starts & ends on cycle 6 fmul [matrix+4] ;starts on cycle 7 fld [vec0+4] ;starts & ends on cycle 8 fmul [matrix+20] ;starts on cycle 9 fld [vec0+4] ;starts & ends on cycle 10 fmul [matrix+36] ;starts on cycle 11 fxch st(2) ;no cost faddp st(5),st(0) ;starts on cycle 12 faddp st(3),st(0) ;starts on cycle 13 faddp st(1),st(0) ;starts on cycle 14 fld [vec0+8] ;starts & ends on cycle 15 fmul [matrix+8] ;starts on cycle 16 fld [vec0+8] ;starts & ends on cycle 17 fmul [matrix+24] ;starts on cycle 18 fld [vec0+8] ;starts & ends on cycle 19 fmul [matrix+40] ;starts on cycle 20 fxch st(2) ;no cost faddp st(5),st(0) ;starts on cycle 21 faddp st(3),st(0) ;starts on cycle 22 faddp st(1),st(0) ;starts on cycle 23 fxch st(2) ;no cost fadd [matrix+12] ;starts on cycle 24 fxch st(1) ;starts on cycle 25 fadd [matrix+28] ;starts on cycle 26 fxch st(2) ;no cost fadd [matrix+44] ;starts on cycle 27 fxch st(1) ;no cost fstp [vec1+0] ;starts on cycle 28, ; ends on cycle 29 fstp [vec1+8] ;starts on cycle 30, ; ends on cycle 31 fstp [vec1+4] ;starts on cycle 32, ; ends on cycle 33 Example 1: (a) FDIV ST(0),ST(1) ADD EAX,ECX INC EDX (b) FMUL ST(0),ST(1) FST [temp] Example 2: (a) FADD1 FSUB FADD2 FMUL (b) FMUL ST(1),ST(0) FMUL ST(2),ST(0) (c) FMUL ST(1),ST(0) FLD ST(4) FMUL ST(0),ST(1) (d) FADD ST(1),ST(0) FLD [temp] FSUB ST(1),ST(0) (e) FADD ST(0),ST(2) FSUB ST(0),ST(1)