Home    Bloggers    Messages    Webinars    Resources   
Tw  |  Fb  |  In  |  Rss
Tom Burke

Does Synthesis Generate Top-Heavy Results?

Tom Burke
Page 1 / 5   >   >>
devel@latke.net
devel@latke.net
3/5/2013 4:02:41 PM
User Rank
Guru
Re: It's never stupid to explore
Hamster: This was between two flip-flops still within the FPGA - fully synchronous.

That wasn't at all clear from either of your posts about this problem.

50%
50%
hamster
hamster
3/5/2013 3:43:00 PM
User Rank
Blogger
Re: It's never stupid to explore
This was between two flip-flops still within the FPGA - fully synchronous. 

50%
50%
devel@latke.net
devel@latke.net
3/5/2013 12:16:33 PM
User Rank
Guru
Re: It's never stupid to explore
Hamster,


Until I clicked as to what was going on it was getting annoying that no matter how much slack I gave it by giving it extra cycles in the path it wouldn't help. Once it had a flip-flop that it could place along the way to the IOB everything was good again - no more timing errors!

And there it is. The combinatorial path from a pin into the fabric is relatively "long" and wholly dependent on routing. And the gotcha, of course, is that a period constraint doesn't cover that path as there's no starting register. An OFFSET IN constraint does cover that path. It won't fix any timing errors but it'll tell you that you don't win.

So when you use the input flop, you get two benefits. One is that the period constraint covers the path from the input path into the fabric. The second is that you've got a very fast path from the pin to that input flop. You still should use the OFFSET IN constraint to ensure that you meet input set-up and hold, though.

50%
50%
hamster
hamster
3/4/2013 11:44:24 PM
User Rank
Blogger
Re: It's never stupid to explore
@Devel: Can only agree with you on all points.

I was getting a 375MHz signal towards the edge of a Spartan 6 LX45 but it would routinely fail timing due to high % of routing delays (this part of the design was fully synchronous). I didn't care if the pixel took a cycle longer to get to the there, I just wanted it to pass timing.

Until I clicked as to what was going on it was getting annoying that no matter how much slack I gave it by giving it extra cycles in the path it wouldn't help. Once it had a flip-flop that it could place along the way to the IOB everything was good again - no more timing errors!

50%
50%
devel@latke.net
devel@latke.net
3/4/2013 11:29:29 PM
User Rank
Guru
Re: It's never stupid to explore
Hamster: It has to decide between flip-flops and the more compact implementation using shift registers. The heuristic is that shift registers is the better way to go.

From a functional point-of-view, what difference does it make that the logic was implemented in sixteen flip-flops vs a single SRL16 element? The functionality is identical. My guess is that the SRL16 implementation is the better way to go inasmuch as it uses (a lot!) fewer resources.

(OK, the one place where it might be preferable to have individual flip-flops would be at clock-domain boundaries. I recall reading somewhere -- perhaps Austin knows? -- an argument that said that in terms of metastability, the flip-flop in the fabric or at the I/O is better than the SRL16 because the latter is implemented in the LUT.)

(A second argument might be in terms of raw performance. I haven't looked at data sheets for this specific detail in awhile but I recall that the flip flops won for pure speed over SRL16s. I forget which family was discussed, and if I wasn't trying to do my taxes I'd look through a couple of data sheets.)

But my point here is this.
  • We care, first and foremost, about functionality. If the logic is functionally correct, then the nitty-gritty of how it's implemented doesn't really matter.
  • Next, we care that the design meets the timing constraints. It may be functionally correct, but if it needs to run at 100 MHz and the timing analysis says it can only do 80 MHz, we lose, so we need to look at how the functionality was implemented and maybe something can be simplified or pipelined or whatever. If we do meet timing, then again, the details of the implementation aren't all that important.
  • Finally, the design has to fit in the target device! This is sometimes at cross-purposes with meeting timing, because to speed things up we may pipeline or replicate logic or do other things that grow the design. If the design doesn't fit, then one must look at how the logic was implemented (maybe the SRL16 is a better use of resources than 16 flip-flops!), or one can simply punt and use the next larger device in the family (if the board wasn't already built and that's an option). And if the design does fit, then whether it takes up 80% of the XC3S200AN or 65% of that device doesn't matter, because the best optimization would be to fit into the XC3S50AN, and you need more size reduction to make it fit (and if you need more than 3 BRAMs you're outta luck anyway).


Hamster: I am only moaning about it as in this case it was not what i wanted - the tools were not aware that I was trying to distrbute the routing delay. Using primatives fixed that :-)

My guess is that you were trying to do something outside of the usual synchronous design paradigm for which FPGAs are suited. In that case, all bets are off.

50%
50%
aj1s
aj1s
3/4/2013 9:00:57 PM
User Rank
Guru
Re: It's never stupid to explore
@hamster: "How in VHDL should I tell the synthisis tools to generate a chain of flip-flops and not a shift register? It is starting to look as if I can't, without resorting to primatives or "tricks" that disable the heuristic (like adding a reset)."

I'm not sure of the exact constraint/directive that XST uses, but it should have one similar to "syn_preserve". This can prevent the synthesis tool from absorbing a signal inside a primitive where it is not accessible via jtag, gate level simulation, etc. It may work here to prevent the intermediate signals between the erstwhile registers from being absorbed into a single SRL16. 

In short, you need to become familiar with synthesis constraints/directives for your synthesis tool. They can do lots of things to formally communicate your "intent" on top of the coded behavior. But the result, not counting non-clock-cycle delays, will always match the behavior of the design.

Most synthesis constraints can be specified in the VHDL code by means of custom attributes applied to specific objects. Or they can be specified in separate constraints files to be invoked by the synthesis tools. There are advantages and disadvantages to both methods. Again, refer to your tool's documentation to determine the specifics of either method.

Also, there seems to be some confusion about what happens during synthesis and what happens during placement and routing. The synthesis tool produces a netlist of a circuit that matches your RTL in behavior. It also produces a constraint file to relay the effects of any synthesis constraints/directives to the placement and routing tool. 

Most P&R tools will "optimize" the synthesis netlist a little, but not much. Then they place and route the design to meet the specified timing constraints. Static Timing Analysis (STA) is used to verify that the timing constraints are met in the final results. 

Some toolsets provide a "physical synthesis" option. This blurs the line between traditional synthesis and P&R tools, by bringing some of the placement awareness into the synthesis tool to allow better, more optimal solutions. Sometimes this is also associated with "design planning" tools.

I know this must seem like you are stumbling around in the dark, finding out about things as bump into them. Be patient, and read the tool documentation. Or ask us!

Andy

50%
50%
jandecaluwe
jandecaluwe
3/4/2013 2:26:19 PM
User Rank
Blogger
Re: It's never stupid to explore
@hamster "...the implemented behavior would not match the behavior of the input code". That is exactly what I got - my input code included the timing constraints, and it the way it decided to implment it didn't work."

I'll try to be precise. When I talk about "match the behavior of the input code" I mean functional correctness only (as you would verify in a 0-delay simulation).

Of course, you are completely right that a design that doesn't meet area, timing, or power constraints also "does not work".

However, the point I'm trying to convey is that a "design intent" driven tool wouldn't even guarantee functional correctness, and that such a tool would be a methodological nightmare. In spite of this, there are still people who think synthesis works or should work like this, which is why I keep pointing out that it's a bad idea.

The important point to understand is that, for synchronous design, one can rely on synthesis for functional correctness. In particular, any heuristic that it uses to select implementation primitives will (should) never compromise this hard contraint. That is a very good start :-) From there of course, we often have to work hard to meet the constraints - otherwise there is no solution.

50%
50%
devel@latke.net
devel@latke.net
3/4/2013 12:59:31 PM
User Rank
Guru
Re: It's never stupid to explore
Hamster,

When you include the different optimization, mapping, placement and routing options there are litterally billions and billions of synthesis implementations that would match the RTL functionality. 

So how does to the tools select which of these billions of options to actually present to me as the final implemented design? Does it generate all the possible solutions and give me the best? No, it has to employ heuristics of some sort

I suggest that it's a lot simpler. The results you get are based on two things: target architecture features and design constraints.

For the former: the synthesizer understands the target architecture. It understands, for example, that a Spartan 3AN FPGA has a multiplier, so when it sees the code inferring a multiplier, it chooses to use that primitive rather than build one (which it would do from its own library of primitives that implement various functions). The synthesizer "knows" about fast carry chains and as such can implement fast adders which use them.

Similarly with combinatorial logic. The synthesizer will do smart logic optimization (basically k-maps) based on best fits to the architecture -- 6-input LUTs vs 4-input LUTs, various extra muxes in the slice.

And then we have timing constraints. Without a period constraint, the tools are free to implement and route the logic in any way that it sees fit, so it probably chooses a lazy default and makes no attempt to optimize routing. With a given period constraint it attempts to pack related logic as close together as possible (probably using some standard geometry things). The tools know the loading on each line, so some registers may be replicated to improve fanout (and also ease routing). The tools also know the amount of routing resources available.

And remember that the tools don't search for the "best" solution or the "optimal" solution. They do what is necessary to meet the given timing constraint. If you tell the tools that some logic must run at 100 MHz, they won't struggle to meet 200 MHz. If the tools come up with a design that uses 60% of your target device, there's no point in trying to get them to fit in 50% of the target device.

OK, not simple. But the tools really do know everything about the target device and that information is used to advantage.

50%
50%
hamster
hamster
3/4/2013 12:34:32 PM
User Rank
Blogger
Re: It's never stupid to explore
@Jan "Well no, it didn't ignore the behavior at all. The synthesis implementation will match the RTL functionally in all cases - simulate it and you'll see."

When you include the different optimization, mapping, placement and routing options there are litterally billions and billions of synthesis implementations that would match the RTL functionality. 

So how does to the tools select which of these billions of options to actually present to me as the final implemented design? Does it generate all the possible solutions and give me the best? No, it has to employ heuristics of some sort

"...a heuristic is a technique designed for solving a problem more quickly when classic methods are too slow, or for finding an approximate solution when classic methods fail to find any exact solution. This is achieved by trading optimality, completeness, accuracy, and/or precision for speed."

The heuristic that goes something like "if you have a chain of registers in a chain, without a reset, and with only one input and output then replace it with a LUT based shift register" is generally a good one - it uses less registers, less routing, and so on. It is the obvious thing to do. In my little case it was not what I wanted as it lumped all the routing delays into the input and output paths, and the design will not meet timing if the routing delay is not spread out...

How in VHDL should I tell the synthisis tools to generate a chain of flip-flops and not a shift register? It is starting to look as if I can't, without resorting to primatives or "tricks" that disable the heuristic (like adding a reset).

"...the implemented behavior would not match the behavior of the input code". That is exactly what I got - my input code included the timing constraints, and it the way it decided to implment it didn't work. I had to give it a hand to guide it to implement it the way it should have done it.

The tools could be programmed that "oh, and should the timing of a path to /from a shift register fail, then split the shift register in half and try again".

50%
50%
jandecaluwe
jandecaluwe
3/4/2013 9:51:31 AM
User Rank
Blogger
Re: It's never stupid to explore
@neilla "After you left I had to look at this problem, and found out it was all due to an asychronous reset going in to a state machine.  It all worked fine after synchronising the reset."

Thanks for pointing that out. Newbies should be confident that synchronous design and synthesis are reliable. They should understand that anyone trying to tell them otherwise is a false prophet that should be ignored.

100%
0%
Page 1 / 5   >   >>
More Blogs from Tom Burke
One of the things I've been wondering is whether or not the "okWireOR" module is really just a giant OR, or if the order in which things are attached matters.
In this blog we extend our previous work with the Opal Kelly FrontPanel SDK (Software Development Kit) to create a GUI that has multiple panels.
In this blog, Tom Burke extends his work with the Opal Kelly FrontPanel by creating a new graphical user interface that employs a numeric slider.
I am convinced that every country has its own governmental office buried deep down in some dark, dank basement -- on this office door is a single word: "They!"
In this column we will extend our Opal Kelly FrontPanel interface to have two "okWireOut" devices in the FPGA... or maybe not depending on your point of view.
flash poll
follow us on twitter
follow Xilinx on twitter
like us on facebook
like Xilinx on facebook
All Programmable Planet     About Us     Contact Us     Help     Register     Twitter     Facebook     RSS