In earlier columns, it may have seemed that I was saying one thing at one time and something completely contradictory at another time, but there is reason behind my madness...
Just to set the scene and remind ourselves how we came to be here, in one column I said:
If a logical function -- say, a counter -- is implemented using the FPGA's programmable fabric, that function is said to be soft. By comparison, if a function is implemented directly in the silicon, it is said to be hard. As these functions become larger and more complex, we tend to refer to them as cores. The advantage of soft cores is that you can make them do whatever you want. The advantage of hard cores is that they occupy less silicon real estate, offer higher performance, and consume less power.
(See:
Ask Max: More Sophisticated FPGA Architectures.)
But later, when talking about the dual ARM Cortex-A9 hard-core processor inside the Zynq 7000 EPP, I said:
One scenario is that the software developers capture their code, run it on the Zynq's Cortex-A9 processors, and profile it to identify any functions that are slugging performance and acting as bottlenecks. These functions can then be handed over to the hardware design engineers for implementation in programmable fabric, where they (the functions, not the design engineers) will provide dramatically higher performance using lower clock frequencies while consuming a fraction of the power.
(See:
Ask Max: FPGAs With Processor Cores.)
So am I trying to have things both ways? On the one hand I seem to be saying that hard-core implementations of functions (and the Zynq's ARM Cortex-A9 processor is a hard core) have higher performance and consume less power than their soft core equivalents. But on the other hand I'm saying that if a software function running on a hard-core processor is a bottleneck, we can implement it in programmable fabric where it will provide higher performance and consume less power. How can this be?
Actually, this really is a surprisingly easy concept to wrap one's brain around (it's a tad harder to implement, of course). The thing is that general-purpose microprocessors and microcontrollers are really horribly inefficient -- the only reason they seem so powerful is that we can ramp up the frequency of the system clock to make them perform more operations per second. Power consumption is, of course, a function of clock frequency, so doubling the frequency doubles the power consumption.
And even if we do increase the clock frequency, this still leaves the processor "thrashing around" when it comes to performing large amounts of data processing and digital signal processing (DSP) functions. As a simple example, suppose we have three 10 x 10 matrices, called "a," "b," and "y," where each element in these matrices is a 32-bit integer. Suppose that we wish to add the contents of matrix "a" to the contents of matrix "b" and store the results in matrix "y." If we were to do this on a processor, the pseudo code might look something like the following:
Pseudo code example for adding two 10 x 10 matrices.
Let's reflect on how the processor handles this. We start with a read instruction that loads the value of the first element from matrix "a" into the CPU. Next we read the corresponding value from matrix "b" and add it to the value currently stored in the CPU. Then we store the result from our calculation to the appropriate element in matrix "y" somewhere in the system's memory. And now we have to do the whole thing again... and again... and again... for each of the matrix elements.
By comparison, we could create a dedicated hardware accelerator using the FPGA's programmable fabric. This hardware accelerator could comprise one hundred 32-bit adders, which means that the entire matrix addition could be performed in a single clock cycle. In turn, this means that the clock controlling this hardware accelerator could be running at a much lower speed than the CPU clock, thereby consuming significantly less power.
Of course I'm being a little overly simplistic here, because the CPU will have to load the input values into the programmable fabric and then retrieve the results, but this could be achieved efficiently using a DMA-type process. Furthermore, as opposed to simply adding the two matrices, we might wish to perform significant amounts of logical and mathematical operations on each element, in which case the programmable fabric option starts to look very, very attractive.
The important point to understand is that processors are wonderful when it comes to performing decision-making control tasks, while hardware accelerators are more suitable when it comes to performing large quantities of repetitive data-processing tasks. Thus, the ideal solution is to achieve the optimal balance between those functions that are implemented in the processor and their compatriots that are implemented in a hardware accelerator.
And one final interesting aspect of all of this is that, after the main processor has provided the hardware accelerator with the appropriate data and instructed it to execute its task, the processor can leave the accelerator to perform its magic while it (the processor) is free to go off and do something else. When the accelerator has completed its mission, it can signal the processor, which will retrieve the data when it is ready and able to do so.
Furthermore, the design team may decide to implement a large number of hardware accelerators in the programmable fabric, each tailored to perform a different task. In some cases, these accelerators will work in isolation, communicating only with the main processor. In other cases, one accelerator may hand its results over to another, and so on and so forth until the final accelerator in the chain hands its results back to the main processor.
I am afraid that this column provides only a very modest overview to what can be quite a complex topic, but I hope that it will provide food for thought and stimulate conversation. What do you think? Does this make sense, or does it raise more questions than it answers?