Jeremy Smith (a.k.a. JezmoSSL), my fellow blogger here on All Programmable Planet, has managed to boost the performance of my Mandelbrot Fractal Viewer to around 500MHz, whereas I was achieving only 320MHz. With my pride bruised, I wanted to find out how and why.
On my voyage of discovery, I learned a lot about multiplication.
One of the first things I discovered about the Zynq-7020 is that it uses 7 Series logic, including the DSP48E1 function (referred to as a "slice"). This slice boasts an 18x25 multiplier, as opposed to the 18x18 signed multipliers found in the DSP48A slices on the Spartan 6 FPGA that I have been using up to this point.
Now, as far as I'm concerned, multiplying 25-bit and 18-bit numbers together seems a slightly odd combination, apart from the obvious advantage of allowing you to multiply by a 25-bit value using only one slice. Not that I'm knocking it, you understand; a mixture of 25-bit and 18-bit numbers fits the bill for many signal processing functions.
Having said this, Mandelbrot Fractals require the use numbers with higher precision, and implementing the large multipliers that are required was -- up to now -- tricky for me, especially when using signed multipliers as primitives. In the case of the Spartan 6 FPGA, 35-bit numbers are the sweet spot, because these are the largest values you can multiply every cycle using only four DSP48A1 slices.
So, what (if any) is the optimal size for the DSP48E1?
Well, unlike the "Spartan-6 FPGA DSP48A1 Slice User Guide," the "7 Series DSP48E1 Slice User Guide" doesn't provide any insights. Fortunately, I found a paper titled Large Multipliers with Fewer DSP Blocks that does. The way this works is as follows: To create a large multiplier for NxM bits, we draw an NxM square, and then tile it with smaller multipliers that are the size of our primitive.
As long as none of the tiles overlap, then the outputs of these sub-units are shifted left by the sum of both products' least significant bit index, and then summed to give the result. If your tiling isn't perfect, you can fix it up by either subtracting off the terms formed when the multipliers overlap, or by adding on extra terms to fill holes that are not completely covered.
In some cases, it's possible to calculate smaller terms using lookup tables (LUTs), thereby saving on DSP blocks. This is not as bad for performance as it might sound, because the DSP block has a few cycles of latency that can be used to pipeline this logic.
As always, there are a few tricks one needs to learn. If you are using a signed multiplier, for example, the most significant bit cannot be used unless it's for the sign bit of one of the two products, so the 25x18 multiplier becomes a 24x17, 25x17, or 24x18 multiplier depending on which MSB bits it overlaps. Additionally, the DSP48E1 includes an optional 17-bit shift in some of its signal paths, and with careful planning, you can take advantage of this feature.
As a demonstration of this technique, the following diagram illustrates how a 9x9 bit signed multiplier can be implemented using, at most, five 6x4 multipliers:
Rather than presenting a huge chunk of unverified VHDL, here is a small C program that verifies that the technique works as designed (click here to access a text version of this code for you to play with):
Once you get the hang of this method, the design of efficient large multipliers becomes a piece of cake! If you know of any similar or related techniques, please share them with the rest of us here.