Chapter start Previous page Next page
2.6.4 Multipliers
Figure 2.27 shows a symmetric 6bit array multiplier (an n bit multiplier multiplies two n bit numbers; we shall use n bit by m bit multiplier if the lengths are different). Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an asymmetric CSA array of 30 (5 ¥ 6) adders (including one half adder). An n bit array multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in Figure 2.27). There are two items we can attack to improve the performance of a multiplier: the number of partial products and the addition of the partial products.
Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It is easier to calculate 15 ¥ 20 and subtract 15. In effect we complete the multiplication as 15 ¥ (20 1) and we could write this as 15 ¥ , with the overbar representing a minus sign. Now suppose we wish to multiply an 8bit binary number, A, by B = 00010111 (decimal 16 + 4 + 2 + 1 = 23). It is easier to multiply A by the canonical signeddigit vector ( CSD vector ) D = (decimal 32 8 + 1 = 23) since this requires only three add or subtract operations (and a subtraction is as easy as an addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B we have reduced the number of partial products by 1 (= 4 3).
We can recode (or encode) any binary number, B, as a CSD vector, D, as follows (canonical means there is only one CSD vector for any number):
D_{ i} = B_{ i} + C_{ i} 2C_{ i} _{ + 1} ,(2.58)
where C_{ i} _{ + 1} is the carry from the sum of B_{ i } _{ + 1} + B_{ i} + C_{ i} (we start with C_{ 0} = 0).
As another example, if B = 011 (B_{ 2} = 0, B_{ 1} = 1, B_{ 0} = 1; decimal 3), then, using Eq. 2.58,
D_{ 0} = B_{ 0} + C_{ 0} 2C_{ 1} = 1 + 0 2 = ,
D_{ 1} = B_{ 1} + C_{ 1} 2C_{ 2} = 1 + 1 2 = 0,
D_{ 2} = B_{ 2} + C_{ 2} 2C_{ 3} = 0 + 1 0 = 1,(2.59)
so that D = (decimal 4 1 = 3). CSD vectors are useful to represent fixed coefficients in digital filters, for example.
We can recode using a radix other than 2. Suppose B is an ( n + 1)digit two's complement number,
B = B_{ 0} + B_{ 1} 2 + B_{ 2} 2^{ 2} + . . . + B i 2^{ i } + . . . + B_{ n} _{ 1} 2^{ n} ^{ 1} B_{ n} 2^{ n} .(2.60)
We can rewrite the expression for B using the following sleightofhand:
2B B = B = B_{ 0} + (B_{ 0} B_{ 1} )2 + . . . + (B_{ i} _{ 1} B_{ i} )2^{ i } + . . . + B n _{ 1} 2^{ n} ^{ 1} B_{ n} 2^{ n}
= (2B_{ 1} + B_{ 0} )2^{ 0} + (2B_{ 3} + B_{ 2} + B_{ 1} )2^{ 2} + . . .
+ (2B_{ i} + B_{ i} _{ 1} + B_{ i 2} )2^{ i} ^{ 1} + (2B_{ i } _{ + 2} + B_{ i } _{ + 1} + B_{ i} _{ } )2^{ i} ^{ + 1} + . . .
+ (2B_{ n} + B_{ i} _{ 1} + B_{ i} _{ 2} )2^{ n} ^{ 1} .(2.61)
This is very useful. Consider B = 101001 (decimal 9 32 = 23, n = 5),
B = 101001 = (2B_{ 1} + B_{ 0} )2^{ 0} + (2B_{ 3} + B_{ 2} + B_{ 1} )2^{ 2} + (2B_{ 5} + B_{ 4} + B_{ 3} )2^{ 4}
= ((2 ¥ 0) + 1)2^{ 0} + ((2 ¥ 1) + 0 + 0)2^{ 2} + ((2 ¥ 1) + 0 + 1)2^{ 4} .(2.62)
Equation 2.61 tells us how to encode B as a radix4 signed digit, E = (decimal 16 8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication by 2 (a shift) and three add/subtract operations.
Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and calculating
E j = 2B_{ i} + B_{ i} _{ 1} + B_{ i} _{ 2} , E_{ j} _{ + 1} = 2B_{ i} _{ + 2} + B_{ i} _{ + 1} + B_{ i} , . . . ,(2.63)
where each 3bit group overlaps by one bit. We pad B with a zero, B_{ n} . . . B_{ 1} B_{ 0} 0, to match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the sign: B_{ n} B_{ n} . . . B_{ 1} B_{ 0} 0. For example, B = 01011 (eleven), encodes to E = (16 4 1); and B = 101 is E = . This is called Booth encoding and reduces the number of partial products by a factor of two and thus considerably reduces the area as well as increasing the speed of our multiplier [Booth, 1951].
Next we turn our attention to improving the speed of addition in the CSA array. Figure 2.28(a) shows a section of the 6bit array multiplier from Figure 2.27. We can collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).
Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs of the next). At each stage we have the following three choices: (1) sum three outputs using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two outputs of an adder are joined by a diagonal line (full adders use black dots, half adders white dots). The object of the game is to choose (1), (2), or (3) at each stage to maximize the performance of the multiplier. In treebased multipliers there are two ways to do thisworking forward and working backward.
In a Wallacetree multiplier we work forward from the multiplier inputs, compressing the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2) counter . To form P_{ 5} in Figure 2.29 we must add 6 summands (S_{ 05} , S_{ 14} , S_{ 23} , S_{ 32} , S_{ 41} , and S_{ 50} ) and 4 carries from the P_{ 4} column. We add these in stages 17, compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry from column P_{ 4} , and this means we expand (rather than compress) the number of signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4bit (9 inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA.
In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda, 1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive stage is 3/2 times largerrounded down to an integer). Thus, for example, in Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10bit output CPA) for a 6bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and smaller than a Wallacetree multiplier.
In general, the number of stages and thus delay (in units of an FA delayexcluding the CPA) for an n bit treebased multiplier using (3, 2) counters is
log_{ 1.5} n = log_{ 10} n /log_{ 10} 1.5 = log_{ 10} n /0.176.(2.64)
Figure 2.31(a) shows how the partialproduct array is constructed in a conventional 4bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) "nests" multipliersthe 2bit submultipliers reduce the number of partial products [Ferrari and Stefanelli, 1969].
There are several issues in deciding between parallel multiplier architectures:
 Since it is easier to fold triangles rather than trapezoids into squares, a Wallacetree multiplier is more suited to fullcustom layout, but is slightly larger, than a Dadda multiplierboth are less regular than an array multiplier. For cellbased ASICs, a Dadda multiplier is smaller than a Wallacetree multiplier.
 The overall multiplier speed does depend on the size and architecture of the final CPA, but this may be optimized independently of the CSA array. This means a Dadda multiplier is always at least as fast as the Wallacetree version.
 The loworder bits of any parallel multiplier settle first and can be added in the CPA before the remaining bits settle. This allows multiplication and the final addition to be overlapped in time.
 Any of the parallel multiplier architectures may be pipelined. We may also use a variably pipelined approach that tailors the register locations to the size of the multiplier.
 Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression and permits the size of the stages to be tuned. Some ASIC cell libraries contain a (7, 3) countera 2bit fulladder . A (15, 4) counter is a 3bit full adder. There is a tradeoff in using these counters between the speed and size of the logic cells and the delay as well as area of the interconnect.
 Power dissipation is reduced by the treebased structures. The simplified carrysave logic produces fewer signal transitions and the tree structures produce fewer glitches than a chain.
 None of the multiplier structures we have discussed take into account the possibility of staggered arrival times for different bits of the multiplicand or the multiplier. Optimization then requires a logicsynthesis tool.
Chapter start Previous page Next page
