In my previous blog about transaction-level synthesis, I discussed the notions of dependency in both the control and data senses. Dependency provides a guide to the synthesis tool about where and how concurrency can be created.
We also have considered the notion of micro-architecture in terms of rough resource allocation and performance. Now it is time to start thinking about building some real circuitry. I am going to continue using the simple data flow from the last blog, reproduced here for your convenience (the expression is A = B + (C*D) - (E+F) * G):
If we wanted the fastest implementation, we could allocate two adders, two multipliers, and a subtractor and then map each of the operators to these allocated resources.
Purely for the sake of these discussions, let us also assume that a multiplier takes longer than an adder, which in turn takes longer than a subtractor. Based on this, we can determine the shortest (i.e., fastest) and longest (i.e., slowest) paths through the design. The longest are caused by changes in inputs C, D, E, and F, while the shortest is associated with a change in input B.
Each critical path requires the use of a multiplier, and several alternative implementations could be chosen to match the desired performance. Keep in mind that this solution would also require a large area and consume a lot of power.
At this stage, you are probably thinking that this looks very much like RTL synthesis, and it is… until we take the next step. This is to look at multi-cycle solutions to this problem, which is generally referred to as scheduling. With RTL synthesis, this all has to fit into one clock cycle. However, with transaction-level synthesis, we can consider a variety of other solutions that share resources over time. The process of looking at multiple architectures without having to modify the original source is referred to as architectural exploration.
Brian Bailey 10/10/2012 4:06:24 PM User Rank Blogger
Re: Can be done vs should be done
Yes, I am still trying to get my head around how cheap Xilinx is making this technology, so as long as you can overcome the learning curve associated with the new tool there probably are benefits even for small designs. Thanks for pointing that out and now I feel better about using small examples...
jandecaluwe 10/10/2012 3:47:15 PM User Rank Blogger
Re: Can be done vs should be done
"Karl, you are right that it can be used for these examples, but there is the economic factor. If this is the type of complexity you are using, then it is unlikely that you would ever get a decent return on investment for buying the tool. However, with the price that Xilinx is offering it for, the type of complexity necessary is much less than it would have to be for an ASIC."
Surely, for that price, the return on investment should be a no-brainer, if the technology works well?
I am a big fan of small examples. That's how I learned RTL synthesis - by understanding very small examples well, not by trying to make sense of complex ones. Moreover, I think that the benefits of RTL synthesis can be convincingly demonstrated with small examples. (Of course, the benefit increases with complexity.) I would think it's the same with HLS.
Brian Bailey 10/9/2012 4:40:53 PM User Rank Blogger
Re: Can be done vs should be done
Karl, you are right that it can be used for these examples, but there is the economic factor. If this is the type of complexity you are using, then it is unlikely that you would ever get a decent return on investment for buying the tool. However, with the price that Xilinx is offering it for, the type of complexity necessary is much less than it would have to be for an ASIC.
@Brian: I thought about using a 2 port ram for the variables and Reverse Polish for the algorithm, but after re-reading this and you said no one would ever use HLS(TLS?) for something as simple as this... At what level of complexity does HLS kick in? Certainly the final design has to handle the simple cases.
There has to be input and output and the even more simple stream of data also has to be handled.
Pipe-lining is more complex but since it is being designed without TLS then that may not be complex enough either.
@Brian, I understnd, but do not have the blind faith to even consider committing a design task to a tool that is incomprehensible. There are cases with today's toiols where pre and post synthesis results differ, so if it happens with TLS it seems that no one would even know where to begin.
The scenario that comes to mind is:
> code and simulate in SystemC
> run TLS
> simulate RTL(HDL?)
> compare
First, SystemC is a superset of C. Now, I am not convinced that C synthesis has really been successful, but the C part of SystemC is somehow successfully synthesized by TLS.
It is not too far fetched to visualize where something went wrong in HDL synthesis, but with TLS being incomprehnsible it is mind-boggling.
Brian Bailey 7/22/2012 9:32:55 PM User Rank Blogger
Re: Can be done vs should be done
Karl - you are right that this is a contrived example created purely to show some of the process that happens. No one would ever use HLS to do something as simple as this. I will get on to power and pipelining and so many other topics in time, but the problem will always be that a big example will not fit in a blog except in a superficial sense. I spent over 30 pages talking about one design, and still superficially, in one of my books.
This kind of example looks like marketing to managers rather than engineers by showing what CAN be done. Engineers should require some cost vs advantage comparison such as the impact of what tool chain has to be used, as well as the resource cost dedicated to such a calculation. I wonder if this example has ever occurred in a design.
The implication of TLS is probably SystemC to generate RTL/HDL, then integrate that into the rest of the chip design flow. Not cheap or easy.
If the example had been some well used DSP calculation done by a skilled designer it would be a different story.
The first case path time would cover the multiply path plus the adder plus the subtractor. The clocked second case could use two clocks, the first for the multiply which is probably the longest and the second covering the adder/subtractor so the overall cycle time increase would cover the reg setup times.
The first case did not register the A output, but the second does and I do not see why.
My guess is that adding the registers and clocking makes the second bigger and use more power because of driving the FPGA fabric with many more connections for muxes and regs.
Yes, Max assumed that the multiplier cost swamped everything else and came to the same conclusion after postulating that clock speed would have to triple.
The second case also illustrates a fallacy among some designers that registering everything is analogous to pipelining and therefore gets max speed.
Seems like a convincingly reasonable guess -- I've of late been wondering on how some of these "optimize for power" synthesis settings work in the various tools --
Max Maxfield 7/18/2012 2:57:10 PM User Rank Blogger
Area / Power Guess
Let's say that the adder and subtractor are the same type of block and thus consume the same area and power as each other. Let's say that the multiplier occupies 20X the area and consumes 20X the power of an adder/subtractor. So the first solution has 2 multipliers and three adder/subtractors, which equals 2 X 20 + 3 x 1 = 40 + 3 in "dimensionless units".
Basically the adder and subtractor blocks are "lost in the noise"
In the case of the second solution we have one multiplier and one adder/subtractor plus the registers (R) and multiplexers (X). So we have 20 + 1 (plus the registers and multiplexers). Again, the adder/subtractor and registers and multiplexers are pretty insignificant.
So I would say that the second solution occupies 1/2 the area and consumes 1/2 the power, but this is at the same clock frequency which means that you get only 1/3 of the performance. If you multiply the clock frequency by 3 to achieve the original performance, you get 1/2 * 3 = 1.5 x the power consumption of the original solution.
I believe 3D ICs are basically the replacement for the PCB. In the near future, the PCB will become nothing other than a holder with the ability to add connectors and perhaps a few components that cannot be economically integrated within the chip package.
Is the recent news that Altera will be using the Intel 14nm node with TriGate technology for their future FPGAs significant, or is it just industry noise?
Recent developments in high-level synthesis (HLS) and IP Integration technology mean that software developers can more easily create hardware to accelerate their applications.
To save this item to your list of favorite All Programmable Planet content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.