Thus far in our discussions about pipelines, we have concentrated on a single loop. We have assumed that every stage of a pipeline will take the same amount of time and we have made many other simplifications.
Life, of course, is never that simple, and algorithms that go into sophisticated products are much more complex than this. Multiple loops, nested loops, sequences, and many other things are the reality. Questions about pipeline stalls and back pressure have come up, and that is most certainly one of those realities. We have also basically ignored notions of feedback until my previous blog, at which time we touched on dependencies within loop iteration.
Now, I want to once again tackle a general theme that has appeared in many of the comments on my blogs. People have said things such as "I would never trust transaction-level synthesis" or "I can do a better job" or "This does not make it possible for a software engineer to build hardware." In fact, I had a discussion with a few colleagues just a few days ago about how the same things were said of RTL synthesis when it first came out. One of those people said, "We used the optimization stuff, but we would never use the synthesis capabilities." Another said, "It is crazy to think that software can do a better job than engineers at turning Karnaugh maps into hardware."
It took engineers a long time to find out the right way to write RTL code to get good results from synthesis and for style guidelines to become common place. Exactly the same things will apply to transaction-level synthesis (TLS). The users of today are blazing the trail and finding out what does and does not work. The vendors are watching them closely and adjusting their tools accordingly. Perhaps in another 10 years' time, 99 percent of engineers will be using TLS. However, I don't think it will happen that quickly because we don't have the verification tools in place.
Another reason why the transition will take longer is because the gap between a transaction-level description and RTL is much larger than the gap was from RTL to gate. At the same time, we are finding out that without considering very low-level impacts of the fabrication processes it is not always possible to know what the best synthesis results would be. Thus RTL synthesis has had to become layout dependent.
Try and think of it this way. Transaction-level exploration allows you to effectively select which algorithm you may want to use and the best overall structure for that algorithm. While the implementation will have some impact on the final figures, we are considering one or more orders of magnitude larger changes when we select the algorithm compared to routing delays and their impact on the micro-architecture. Also, software is software. We are not synthesizing software to make hardware. We are describing hardware using a language similar to one used for software.
Let me make this clear by providing a real example. Consider the AES encryption standard. This is a four-stage transformation and one of those stages is shown below:
This involves a non-linear substitution of each of the elements forming a 2x2 array. The reference software implementation for this is as follows:
The synthesis of that would have to access memory for each element in the array and -- using a lookup table -- find the new value to write back into the array. The hardware implementation would basically run at the same speed of the software. But this was a reference implementation defined because the equivalent hardware function was too difficult and time-consuming for software. The hardware implementation is as follows:
There is no way to translate, synthesize, or whatever you want to call it from one description to the other, but they both perform the same function at the end of the day. The bottom line is that software implementation is not hardware specification. Please can we move on? I will not address this issue again -- I promise.
Re:software implementation is not hardware specification
@Karl "And I cannot wait to see jan's reply to this one."
Sorry to have kept you waiting, but I needed to do my homework first.
You can find my extensive analysis of @brian's example elsewhere in this thread. Although I haven't received feedback yet, my conclusion is that the example is misleading and actually leads to the opposite conclusion if analyzed properly.
In summary: what @brian calls a "hardware implementation" is just one possible hardware description. On its own it may synthesize slightly more efficiently than alternative descriptions (although I doubt that for state-of-the-art ASIC synthesis), but I demonstrated why it is likely to result in a less optimal solution overall than the "software" description. (The reason is that the complete design problem probably needs a look-up table anyway.)
The bottom line is that if I am correct, the "software description" is the most efficient input for both a "software" and a "hardware" implementation. This analysis invalidates @brian's conclusion that software != hardware, and his suggestion to "move on".
Personally, I find it most useful to think about HDL design as event-driven microthread programming. Some of those programs can be implemented very efficiently thanks to a special compiler called a synthesis tool.
jandecaluwe 2/16/2013 2:03:40 PM User Rank Blogger
In reality, your example tells the opposite story
@brian I have analyzed your example based on [1] and [2]. I believe I have discovered several flaws that basically lead to the opposite conclusion.
You start with a "reference software implementation" like this:
for i in range(4):
for j in range(4):
state[i][j] = lookup(state[i][j])
You claim the following about this:
"The synthesis of that would have to access memory for each element in the array and -- using a lookup table -- find the new value to write back into the array. The hardware implementation would basically run at the same speed of the software."
I don't see why. There are no loop dependencies and the lookup tables are fixed. Therefore, I could feed this to an RTL synthesis tool and it would give me a fully parallel update of the whole 4x4 array in one cycle.
In my view, RTL synthesis is HLS synthesis with a clock cycle constraint of 1. I would expect that a HLS synthesis tool is able to generate a range of solutions by relaxing the clock cycle constraint. Without constraint, it would fall back to the minimal area, "software" implementation. This is a classical area/timing trade-off.
Further on you say: "The hardware implementation is as follows":
out[i] = in[i] ^ in[(i+4)%8] ^ in[(i+5)%8] ^ ...
This is misleading in several ways. First, this equation doesn't correspond to the code above. The 'i' here refers to a single bit in a value, not to a row in a 4x4 array. To get the equivalent of just the lookup() function, you have to put a for loop around the bit equation.
Secondly, this is not "the" hardware implementation. It is just one possible hardware description, possibly a very good one. However, the lookup() function above is also a hardware description that can potentially be implemented very efficiently.
Then you say:
"There is no way to translate, synthesize, or whatever you want to call it from one description to the other, but they both perform the same function at the end of the day."
Here I really don't follow. Surely in one direction (from the equations to the lookup) the translation is trivial: just evaluate the equations and that gives you the lookup table.
More importantly, what matters is not whether one can be translated to the other, but how efficiently they synthesize. I have tried it by comparing the equations with the equivalent lookup. In Xilinx, the difference is barely noticable (as I would also expect in ASIC synthesis), in Altera it is larger, with the equations indeed having the edge.
However, there is one more important issue: the equations only correspond to the second step in the SubBytes transformation, known as the affine transformation. In the SubBytes lookup() function, the first step (the multiplicative inverse) is included also. I'm not aware of elegant equations for that step. Moreover, I don't see why the lookup() for the SubBytes step as a whole would require more area than a lookup for just the affine transformation. Lumping the two steps together in a single lookup seems the way to follow.
"The bottom line is that software implementation is not hardware specification."
If I'm correct, the lookup() version for the problem as whole will be the superior solution, and the "hardware implementation" equations are superfluous.
The conclusion is that we have a single "reference" implementation that we can feed to a "software" compiler, a HLS synthesis tool, or an RTL synthesis tool, and expect excellent results. Following the logic of your article, that would suggest that software and hardware are conceptually indistinguishable, and that the only difference is the way in which different compilers look at the input description.
And that is exactly how I happen to think about it.
Hamster, am agreeing with you. Both hardware and software has their own stand alone identity and when they are integrating at the right mix, it's more powerful. Now a day's we can implement hardware functionalities also in digital form with software tools and programming languages.
Yes. Hardware can run on it its own (if you define hardware as a digital system - inputs, outputs, logic and so on).
I am not 'into' implementing soft CPUs in my FPGA projects so nearly all of them are competely devoid of software, even though they might have a mouse, keyboard or monitor attached.
Max Maxfield 1/30/2013 10:23:39 AM User Rank Blogger
Re: Hardware & Software
@hash: I was the same way -- I started off as a young lad building hobby projects with transistors and 7400-series TTL chips -- so my brain was totally "tuned" to the way hardware works. I learned software programming -- FORTRAN and Assembly while at University, and Pascal and C after I left University -- Iand I was happy with that, but when I first ran across an HDL (before Verilog and VHDL existed) I immediately understood the concept and how it mapped onto the hardware.
hash.era 1/30/2013 10:16:10 AM User Rank Clever Clogs
Re: Hardware & Software
I understood it really well. I dont know why others cant. Normally I have to go through 3 or 4 times to get something into my stupid brain but this went in like missile.
I believe 3D ICs are basically the replacement for the PCB. In the near future, the PCB will become nothing other than a holder with the ability to add connectors and perhaps a few components that cannot be economically integrated within the chip package.
Is the recent news that Altera will be using the Intel 14nm node with TriGate technology for their future FPGAs significant, or is it just industry noise?
Recent developments in high-level synthesis (HLS) and IP Integration technology mean that software developers can more easily create hardware to accelerate their applications.
To save this item to your list of favorite All Programmable Planet content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.