Figure 2.27 shows a symmetric 6-bit array multiplier (an n -bit multiplier multiplies two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different). Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an asymmetric CSA array of 30 (5 ¥ 6) adders (including one half adder). An n -bit array multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in Figure 2.27). There are two items we can attack to improve the performance of a multiplier: the number of partial products and the addition of the partial products.
Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It is easier to calculate 15 ¥ 20 and subtract 15. In effect we complete the multiplication as 15 ¥ (20 1) and we could write this as 15 ¥ , with the overbar representing a minus sign. Now suppose we wish to multiply an 8-bit binary number, A, by B = 00010111 (decimal 16 + 4 + 2 + 1 = 23). It is easier to multiply A by the canonical signed-digit vector ( CSD vector ) D = (decimal 32 8 + 1 = 23) since this requires only three add or subtract operations (and a subtraction is as easy as an addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B we have reduced the number of partial products by 1 (= 4 3).
Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = (decimal 16 8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication by 2 (a shift) and three add/subtract operations.
where each 3-bit group overlaps by one bit. We pad B with a zero, B n . . . B 1 B 0 0, to match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the sign: B n B n . . . B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = (16 4 1); and B = 101 is E = . This is called Booth encoding and reduces the number of partial products by a factor of two and thus considerably reduces the area as well as increasing the speed of our multiplier [Booth, 1951].
Next we turn our attention to improving the speed of addition in the CSA array. Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).
FIGURE 2.28 Tree-based multiplication. (a) The portion of Figure 2.27 that calculates the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to a Wallace tree (cells 5.15.5). (c) The stages of multiplication.
Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs of the next). At each stage we have the following three choices: (1) sum three outputs using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two outputs of an adder are joined by a diagonal line (full adders use black dots, half adders white dots). The object of the game is to choose (1), (2), or (3) at each stage to maximize the performance of the multiplier. In tree-based multipliers there are two ways to do thisworking forward and working backward.
In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2) counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 , S 41 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17, compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry from column P 4 , and this means we expand (rather than compress) the number of signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9 inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA.
In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda, 1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive stage is 3/2 times largerrounded down to an integer). Thus, for example, in Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10-bit output CPA) for a 6-bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and smaller than a Wallace-tree multiplier.
Figure 2.31(a) shows how the partial-product array is constructed in a conventional 4-bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) "nests" multipliersthe 2-bit submultipliers reduce the number of partial products [Ferrari and Stefanelli, 1969].
- Since it is easier to fold triangles rather than trapezoids into squares, a Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger, than a Dadda multiplierboth are less regular than an array multiplier. For cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier.
- The overall multiplier speed does depend on the size and architecture of the final CPA, but this may be optimized independently of the CSA array. This means a Dadda multiplier is always at least as fast as the Wallace-tree version.
- The low-order bits of any parallel multiplier settle first and can be added in the CPA before the remaining bits settle. This allows multiplication and the final addition to be overlapped in time.
- Any of the parallel multiplier architectures may be pipelined. We may also use a variably pipelined approach that tailors the register locations to the size of the multiplier.
- Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression and permits the size of the stages to be tuned. Some ASIC cell libraries contain a (7, 3) countera 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a trade-off in using these counters between the speed and size of the logic cells and the delay as well as area of the interconnect.
- Power dissipation is reduced by the tree-based structures. The simplified carry-save logic produces fewer signal transitions and the tree structures produce fewer glitches than a chain.
- None of the multiplier structures we have discussed take into account the possibility of staggered arrival times for different bits of the multiplicand or the multiplier. Optimization then requires a logic-synthesis tool.