The Z80180 and Big-number Arithmetic

Squeezing 512-bit operations out of 8-bit microcontrollers

This is pretty good, since no method can do better than log₂c on average. The "quaternary" method, which considers the exponent c as radix-4 digits, averages about 1.375 lambda (c) modular multiplications; radix-m for larger m may do better. Exponent "recoding" may also help (refer to C.K. Koc's "High-radix and Bit Recoding Techniques for Modular Exponentiation," International Journal of Computer Mathematics, 1987).

The "Chinese Remainder Theorem" (CRT) approach speeds up RSA private-key exponentiation by almost a factor of four by taking advantage of simpler operations modulo the prime factors p and q. Instead of a 512-bit modular exponentiation, we have two 256-bit modular exponentiations, each of which is eight times faster with conventional methods than a 512-bit modular exponentiation.

With CRT and quaternary, a 512-bit RSA private-key operation takes on average about 2x1.375x256=704 256-bit modular multiplications, plus a small overhead due to the CRT. Let's call this 750 256-bit modular multiplications, just to be safe.

Representation

Since our operands are several hundred bits long, we'll represent them as multiple-precision arrays. Let r be the digit radix of the machine, so r=256 for the 8-bit Z80180. (Our discussion carries over to any digit radix.) We represent an integer x as an n-digit array x[0],_,x[n-1], according to the sum in Figure 2. Thus x[0] is the least significant digit and x[n-1] is the most significant.

Multiplication on the Z80180

We wish to compute a=bc, where a has 2n digits, and b and c have n digits. An easy algebraic exercise gives the expression in Figure 3(a). Such an expression leads to multiplication by "operand scanning," along the lines of the grade-school approach. We multiply by digits of the operand b from least to most significant, following the weights rⁱ.

A somewhat more difficult exercise gives a different expression as shown in Figure 3(b) which leads to multiplication by "product scanning," which is like convolution in signal processing. We compute digits of the product a from least to most significant, following the weights r^k.

If you're unfamiliar with product scanning, see Figure 4 in which the multiplication tableau consists of cross products between pairs of operand digits. Their columnwise sum, with carry propagation right to left, gives the product.

The operand-scanning method computes the product by accumulating partial products [i]rⁱ for each i. As Figure 5 illustrates, there are n iterations. The variable x carries between iterations of the j loop. There is no carry between iterations of the i loop.

Most of the work in this method is in the j loop, so we can estimate the performance by implementing that loop; see Figure 6. Each iteration computes xx + a half brace i+j + b[i],c[j]x mod r, and xx/r, where j varies. Register C contains the digit b[i], index register IX points to c[j], and index register IY points to a[i+j]. Register B counts down the number of iterations. Register E carries the low half of x from one iteration to the next, and register D contains the value 0. An iteration takes 105 cycles.

By hardcoding j as an index to the LD instructions, we can unroll the loop and avoid the last three instructions. The result takes 82 cycles.

It's easy to see that there are n² iterations of the j loop, so multiplication with this method takes about 82n² cycles, plus whatever overhead there is in the n iterations of the i loop.

Product-scanning, on the other hand, computes the product by accumulating partial products for each k. (See Figure 7.) As Figure 8 shows, there are 2n iterations. The variable x accumulates within an iteration of the i loop, and between iterations of the k loop. Its value is always less than nr². If nr then x needs at most three digits.

The index k-i, like i, is always between 0 and n-1. On the last iteration of the k loop, i ranges from n to n-1, so the i loop has no iterations.

The implementation of this method's i loop (Figure 9) takes 79 cycles per iteration. Each iteration computes x¨x+b[i]c[k-i], where i varies. Register C contains the value 0, index register IX points to b[i], and index register IY points to c[k-i]. Register B counts down the number of iterations. Registers A, H, and L carry the value of x from one iteration to the next. (You can save a few more cycles by exchanging the roles of HL and IY. LD E,(HL) costs only six cycles and INC HL only four, while ADD IY,DE costs ten, for a net savings of eight cycles. But "shifting" IY after the i loop is a little harder. Also, the Z80180 doesn't have an indexed addressing mode for HL. If we're going to unroll the loop, we're best staying with IY as an index register.)

This method is simpler than the operand-scanning method on the Z80180, and it's also 25 percent faster per iteration!

Unrolling, as expected, saves 23 cycles; we hardcode i as an index to the LD instructions. The result takes just 56 cycles per iteration.

Multiplication with this method takes about 56n² cycles, plus whatever overhead there is in the 2n iterations of the k loop. Although there are more iterations of the k loop here than iterations of the i loop in the operand-scanning method, the additional overhead is not significant compared to the 26n²-cycle savings.

We have our first estimate of RSA performance: 256-bit multiplication, with n=32 and 10 percent overhead, takes 56x32²x1.1 (about 63,000) cycles, or 6.3 milliseconds at 10 million cycles/second.

Comparison

Why (and when) is product scanning faster than operand scanning? Product scanning stores each product digit once, not after every multiply. Of course, it fetches operand digits before every multiply. But it doesn't fetch product digits at all! In all, product scanning has about one-third fewer memory references than operand scanning. It also has many fewer "shifts."

Product scanning needs more register storage than register scanning. The variable x in product scanning needs at least three digits, whereas in the operand-scanning method it needs only one digit. With two pointers, two multiplier inputs, and perhaps a counter already in registers, some processors may not have enough register storage left. The Z80180 does have enough left, a benefit of its 16-bit register pairs. (Other processors may have enough registers for an r=256 implementation, but not an r=2¹⁶ implementation; operand scanning with r=2¹⁶, if possible, would be preferable to product scanning with r=256.)

BigMult Subroutine

BigMult is a subroutine that computes the product of two multiple-precision integers. Listing One (page 90) gives a non-optimized C version, Listing Two (page 90) is its header file, and Listing Three (page 90) is an optimized assembly-language version in Z80180 assembler (written with Microtec Research's Z80180 Assembler 6.0. It interfaces to Microtec's C 6.0.) In each version, global variables BIG_MULT_A, BIG_MULT_B, BIG_MULT_C, and BIG_MULT_N correspond to variables a, b, c, and n discussed earlier. BigMult implements product-scanning, and consists of two parts--one for k from 0 to n-1, the other for k from n to 2n-1. Each part calls an "inner loop" that computes the sum in Figure 7. (This is the same as the i loop in Figure 9.) For performance, the inner loop is unrolled, and that's where efficient RSA methods give way to Z80180 programming techniques.

The unrolled inner loop presents a challenge because we have to jump to a different point in the loop each time--the number of iterations varies from 0 to n-1. We must therefore jump to an indirect address. The only way to do this seems to be to jump indirectly through one of the registers HL, IX, or IY. But each of them already has a role in the unrolled loop.

The solution is a familiar one in Z80180 programming: Don't jump to the loop, return to it. The sequence:

LD DE,(address)
PUSH DE
RET

loads an address into register pair DE, pushes it onto the stack, and returns, thereby jumping to the address.

Since the inner loop is called from more than one place, the caller pushes the return address on the stack before pushing the inner loop address. The inner loop then returns to the right caller.

Index registers IX and IY point to operands b and c, respectively. During the first half, IX moves ahead while IY stays fixed; during the second half, IY moves ahead while IX stays fixed. The jump address changes each time by the length of the inner iteration, so we just add or subtract that length before jumping. The pointer to the product a is on the stack.

The REPT directive unrolls the inner loop, with the variable iteration ranging from the number of iterations down to 1. The inner iteration hardcodes the counter as an index to the load instructions that fetch b[i] and c[j].

BigMult can handle up to 128-byte numbers, being limited by the range of offsets in the indexed load instruction. With a slight variation to the outer loop, BigMult could handle larger numbers.

Modular Multiplication

Multiplication is half of modular multiplication; the other half is division. Division takes only a little longer than multiplication, though implementing it is generally more difficult.

An alternative is Montgomery multiplication which has the advantage in that it can be implemented either by product or operand scanning--and its implementation is almost identical to multiplication. Division seems inherently to involve operand scanning, and we prefer product scanning for the Z80180. (See the accompanying textbox "Montgomery Multiplication.")

A disadvantage of Montgomery multiplication is that you need to convert in and out of the Montgomery "representation." But once you're in that representation, Montgomery multiplication keeps you there. You need only to convert at the start of modular exponentiation, and convert out at the end.

Montgomery multiplication code takes just about twice as many cycles as ordinary multiplication, which brings us to around 12.6 milliseconds for 256-bit Montgomery multiplication.

It takes 750 times as long, or approximately 9.5 seconds, for a 512-bit RSA private-key operation--comfortably within our 10-second goal.

Conclusions

Architectural features such as the Z80180's 16-bit HL register-pair open the door to efficient methods such as multiplication by product scanning. Programming techniques such as jumping with a return instruction also play an important role.

Further work we're considering includes special code for modular squaring which is 25 percent faster than modular multiplication. Since most of the estimated 750 modular multiplications are squarings, we can expect to save perhaps another two seconds.

Many of the methods and techniques can be extended to other processors, and the area of cryptography implementation remains a critical and practical challenge.

Acknowledgments

Zilog's Mark van Zanten, Alan Chan, and Adam Tucholski provided computing support for the code described in this article. RSA's Jim Bidzos encouraged me to publish the article, and RSA's Matt Robshaw offered comments and suggestions.

Montgomery Multiplication

Most methods for modular multiplication involved division or an approximation to it until Montgomery published a brief note in 1985 renewing earlier ideas of Hensel (see "Modular Multiplication without Trial Division" by P.L. Montgomery, Mathematics of Computation, 1985).

Let d be an odd integer, n be its length in digits, and r be the digit radix. Let v be the least positive integer such that vd+1 is divisible by rⁿ. This value depends only on d, r and n, and can be computed by the extended Euclidean algorithm (see Knuth).

Montgomery multiplication, which we write M_d, is defined as M_d(b,c)=

(bc+(vbc mod rⁿ)d)/rⁿ, and it has the following properties:

It involves only multiplications. The operation mod rⁿ is just truncation, since r is the digit radix. The operation /rⁿ is just copying, since bc+(vbc mod rⁿ)d is divisible by rⁿ.

If b and c are between 0 and d-1, then M_d(b,c) is between 0 and 2d-1; it can be reduced to between 0 and d-1 modulo d with, at most, one subtraction.

It obeys the usual multiplicative laws: M_d(b,c)= M_d(c,b); and M_d(M_d(b,c),e)=Md(b,M_d(c,e)). The identity is rⁿ mod d.

It relates to ordinary modular multiplication through the ratio rⁿ mod d: M_d((brⁿ mod d),(crⁿ mod d)) = bcrⁿ mod d. For this reason, brⁿ mod d is termed the "Montgomery representation" of b. The ordinary product of Montgomery representations is the Montgomery representation of the ordinary product.

Montgomery multiplication can be implemented either with product scanning or with operand scanning. With product scanning, the computation of terms bc and vbc mod rⁿ can be interleaved, as in Figure 10.

The first k loop computes in a the equivalent of vbc mod rⁿ, and adds the (vbc mod rⁿ)d term and the bc term. Since the least significant digits of the sum are known to be 0, the loop does not store them, although it does carry between iterations in the variable x.

The second k loop continues to add the (vbc mod rⁿ)d and bc terms, storing the sum in a. By starting at a[0] rather than a[n], the loop divides by rⁿ in the process.

The interleaved product-scanning approach saves loop overhead, and it needs only n digits for a, not 2n digits as in multiplication followed by division, or in a noninterleaved approach.

--B.S.K.

Figure 1: Algorithm for modular exponentiation.

Figure 2: Summation used to represent an integer x as an n-digit array.

Figure 3: (a) Expression for operand scanning; (b) expression for product scanning.

Figure 4: Example of multiplication by product scanning.

Figure 5: Algorithm for operand scanning.

Figure 6: Operand-scanning iteration in Z80180 assembler.

Cycles        Instruction
     LOOP:
       ; Fetch c[j] and multiply by b[i].
       ; HL contains the 16-bit product.
14     LD H,(IX)
 4     LD L,C
17     MLT HL
       ; Add carry to HL.
 7     ADD HL,DE
       ; Fetch a[i+j], and add to HL.
14     LD E,(IY)
 7     ADD HL,DE
       ; Store new a[i+j] and copy new carry.
15     LD (IY),L
 4     LD E,H
       ; Next iteration.
 7     INC IX
 7     INC IY
 9     DJNZ LOOP
---
105

Figure 7: Partial products in the product-scanning method.

Figure 8: Algorithm for product scanning.

Figure 9: Product-scanning iteration in Z80180 assembler.

========================================================================
  Cycles     Instruction
========================================================================
             LOOP:
             ; Fetch b[i] and c[k-i], and multiply.
             ; DE contains the 16-bit product.
      14     LD D,(IX)
      14     LD E,(IY)
      17     MLT DE
             ; Add product to AHL.
       7     ADD HL,DE
       4     ADC A,C
             ; Next iteration.
       7     INC IX
       7     DEC IY
       9     DJNZ LOOP
      --
      79
========================================================================

Figure 10: Algorithm for Montgomery multiplication by product scanning.



_THE Z80180 AND BIG-NUMBER ARITHMETIC_
by Burton Kaliski, Jr.


[LISTING ONE]

/* Copyright (C) RSA Data Security, Inc. created 1993. All rights reserved. */

#include "bigmult.h"

unsigned char BIG_MULT_A[256];
unsigned char BIG_MULT_B[128];
unsigned char BIG_MULT_C[128];
unsigned int BIG_MULT_N;

/* Computes a = b*c. Lengths: a[2*n], b[n], c[n]. */
void BigMult (void)
{
  unsigned long x;
  unsigned int i, k;
  x = 0;
  for (k = 0; k < BIG_MULT_N; k++) {
    for (i = 0; i <= k; i++)
      x += ((unsigned long)BIG_MULT_B[i])*BIG_MULT_C[k-i];
    BIG_MULT_A[k] = (unsigned char)x;
    x >>= 8;
  }
  for (; k < (unsigned int)2*BIG_MULT_N; k++) {
    for (i = k-BIG_MULT_N+1; i < BIG_MULT_N; i++)
      x += ((unsigned long)BIG_MULT_B[i])*BIG_MULT_C[k-i];
    BIG_MULT_A[k] = (unsigned char)x;
    x >>= 8;
  }
}

[LISTING TWO]



/* Copyright (C) RSA Data Security, Inc. created 1993. All rights reserved. */

extern unsigned char BIG_MULT_A[256];
extern unsigned char BIG_MULT_B[128];
extern unsigned char BIG_MULT_C[128];
extern unsigned int BIG_MULT_N;
void BigMult (void);

[LISTING THREE]



; Copyright (C) RSA Data Security, Inc. created 1993. All rights reserved.

  NAME BIGMULT

  SEGMENT DSEG,BYTE,DATA

  PUBLIC _BIG_MULT_A
_BIG_MULT_A DEFS 256
  PUBLIC _BIG_MULT_B
_BIG_MULT_B DEFS 128
  PUBLIC _BIG_MULT_C
_BIG_MULT_C DEFS 128
  PUBLIC _BIG_MULT_N
_BIG_MULT_N DEFS 2

address DEFS 2

  SEGMENT CSEG,BYTE,CODE
  ; BigMult computes a = bc.
  ; Assumes 1 <= n <= 128.
  ; Lengths: a[2*n], b[n], c[n].

  PUBLIC _BigMult
_BigMult:
  ; Save registers.
  PUSH IX
  ; Point address to end of inner loop.
  LD HL,endInnerLoop
  LD (address),HL
  ; Push pointer to a.
  LD IX,_BIG_MULT_A
  PUSH IX
  ; Set x <- 0, point IX to b[0], and point IY to c[0]. HL is x.
  ;   C = 0 is a constant.
  LD HL,0
  LD IX,_BIG_MULT_B
  LD IY,_BIG_MULT_C
  LD C,0
  ; n iterations.
  LD A,(_BIG_MULT_N)
  LD B,A

firstLoop:
    ; Clear carry and call inner loop. The call is implemented with a "return"
    ;   since there aren't enough registers to hold the address. The inner loop
    ;   computes the inner iteration for (i,j) from (0,k) to (k-1,1).
    LD DE,firstLoopReturn
    PUSH DE
    LD DE,(address)
    PUSH DE
    XOR A
    RET

firstLoopReturn:
    ; Compute x <- x + b[k]*c[0].
    LD D,(IX)
    LD E,(IY)
    MLT DE
    ADD HL,DE
    ADC A,C
    ; Compute a[k] <- x mod r, and increment k.
    EX (SP),IX
    LD (IX),L
    INC IX
    EX (SP),IX
    ; Compute x <- x / r, where r is the digit radix.
    LD L,H
    LD H,A
    ; Move IX forward, move address back, decrement loop counter, and repeat.
    INC IX
    PUSH HL
    LD HL,(address)
    LD DE,-innerIterationLen
    ADD HL,DE
    LD (address),HL
    POP HL
    DJNZ firstLoop

  ; Move address forward. (Number of inner iterations should be the same the
  ; last time through the first loop as first time through second loop: n-1.)
  PUSH HL
  LD HL,(address)
  LD DE,innerIterationLen
  ADD HL,DE
  LD (address),HL
  POP HL
  ; n iterations.
  LD A,(_BIG_MULT_N)
  LD B,A

secondLoop:
    ; Clear carry and call inner loop. The call is implemented with a "return"
    ;   since there aren't enough registers to hold the address. The inner loop
    ;   computes the inner iteration for (i,j) from (k-n+1,n-1) to (n-1,k-n+1).
    LD DE,secondLoopReturn
    PUSH DE
    LD DE,(address)
    PUSH DE
    XOR A
    RET

secondLoopReturn:
    ; Compute a[k] <- x mod r, and increment k.
    EX (SP),IX
    LD (IX),L
    INC IX
    EX (SP),IX
    ; Compute x <- x / r.
    LD L,H
    LD H,A
    ; Move IY forward, move address forward, decrement loop counter, repeat.
    INC IY
    PUSH HL
    LD HL,(address)
    LD DE,innerIterationLen
    ADD HL,DE
    LD (address),HL
    POP HL
    DJNZ secondLoop

  ; Pop pointer to a.
  POP IX
  ; Restore registers and return.
  POP IX
  RET
  ; Unrolled inner loop. Each iteration computes x <- x + b[i]c[j] for some
  ; (i,j). HL is x, A is carry, IX points to b[last i], IY points to c[last j].
InnerLoop:
iteration DEFL 127
  REPT 127
    LD D,(IX-iteration)
    LD E,(IY+iteration)
    MLT DE
    ADD HL,DE
    ADC A,C
iteration DEFL iteration-1
  ENDR
endInnerLoop:
  RET
innerLoopLen EQU endInnerLoop - InnerLoop
innerIterationLen EQU innerLoopLen / 127
  END