In my previous column, I mentioned differences in implementation. Today we'll take a quick look at how the compiler (OK, synthesizer, but I'm probably still gonna call it a compiler) appears to translate some things.
In my brief time here on All Programmable Planet, I've noticed that a lot of people out there are hammering out code to develop not just FPGAs, but also ASICs, PLDs, and whatever other stuff they use HDLs for. I respect those guys a lot -- they're doing lots of work and getting some nifty things done. Hopefully, some of them (and you) respect me for my background of hardware nuts and bolts, even though I've not done all that much.
Or maybe not. Thinking about it, you probably respect me as much as my kids do. To them, I'm just a wallet with wheels.
Speaking of kids, they seem to think you can regularly get something for nothing. That's the whole point of this installment. Let's take a look at the technology schematic of the ring counter we built last time. In the Design Hierarchy window, under the Synthesize line, there is a View Technology Schematic option, as illustrated below.
Double-click that option to access the following dialogue window.
Click OK on the wizard, and you'll be presented with the following schematic/symbol.
No big deal, right? Now double-click on the big symbol, and the integrated software environment will show you what's in the guts, as illustrated below (click here for a larger version).
This looks pretty much like what we built in the schematic, right? Well, now open up Duane Benson's project from a few months back, and do the same thing. Make sure it's all compiled and working, and then take a look at the technology schematic. What do you see? This is what I got (click here for a larger version).
Quite a bit different, huh? Never mind the clock buffer thingy (that's a nifty piece of kit), but the implementation of his design looks to have used a lot more hardware. I see lookup tables, registers, XORs -- maybe twice as much hardware. It looks to me like his single line "led_count <= led_count+1" has created a full adder -- 26 bits wide. Here's that portion of the code (click here for a larger version).
OK, that's neat, and maybe with the way logic cells work, it doesn't make a big difference. But from my point of view, there's a big warning flag waving here. Blindly hammering out code could lead to some unforeseen issues. These issues could be related to timing loop closure, the availablility of resources on your device, or who knows what.
I also have to wonder about optimization. Is this already optimized, or is there another step that cleans some stuff up? Is there a way to see the after-optimization schematic? Is the way Benson incremented his register one reason so much hardware was used? Would it have been different had he added only a single bit?
What do the experts say? Is this something we should worry about? In general, do we use tons more real estate coding in Verilog (or VHDL) than we do just wiring things up by hand? Could these things create timing problems? What are the pitfalls? Was this a stupid exercise and a waste of my time? Are my intuitions like bird dogs barking up the wrong tree?
Until I clicked as to what was going on it was getting annoying that no matter how much slack I gave it by giving it extra cycles in the path it wouldn't help. Once it had a flip-flop that it could place along the way to the IOB everything was good again - no more timing errors!
And there it is. The combinatorial path from a pin into the fabric is relatively "long" and wholly dependent on routing. And the gotcha, of course, is that a period constraint doesn't cover that path as there's no starting register. An OFFSET IN constraint does cover that path. It won't fix any timing errors but it'll tell you that you don't win.
So when you use the input flop, you get two benefits. One is that the period constraint covers the path from the input path into the fabric. The second is that you've got a very fast path from the pin to that input flop. You still should use the OFFSET IN constraint to ensure that you meet input set-up and hold, though.
I was getting a 375MHz signal towards the edge of a Spartan 6 LX45 but it would routinely fail timing due to high % of routing delays (this part of the design was fully synchronous). I didn't care if the pixel took a cycle longer to get to the there, I just wanted it to pass timing.
Until I clicked as to what was going on it was getting annoying that no matter how much slack I gave it by giving it extra cycles in the path it wouldn't help. Once it had a flip-flop that it could place along the way to the IOB everything was good again - no more timing errors!
Hamster: It has to decide between flip-flops and the more compact implementation using shift registers. The heuristic is that shift registers is the better way to go.
From a functional point-of-view, what difference does it make that the logic was implemented in sixteen flip-flops vs a single SRL16 element? The functionality is identical. My guess is that the SRL16 implementation is the better way to go inasmuch as it uses (a lot!) fewer resources.
(OK, the one place where it might be preferable to have individual flip-flops would be at clock-domain boundaries. I recall reading somewhere -- perhaps Austin knows? -- an argument that said that in terms of metastability, the flip-flop in the fabric or at the I/O is better than the SRL16 because the latter is implemented in the LUT.)
(A second argument might be in terms of raw performance. I haven't looked at data sheets for this specific detail in awhile but I recall that the flip flops won for pure speed over SRL16s. I forget which family was discussed, and if I wasn't trying to do my taxes I'd look through a couple of data sheets.)
But my point here is this.
We care, first and foremost, about functionality. If the logic is functionally correct, then the nitty-gritty of how it's implemented doesn't really matter.
Next, we care that the design meets the timing constraints. It may be functionally correct, but if it needs to run at 100 MHz and the timing analysis says it can only do 80 MHz, we lose, so we need to look at how the functionality was implemented and maybe something can be simplified or pipelined or whatever. If we do meet timing, then again, the details of the implementation aren't all that important.
Finally, the design has to fit in the target device! This is sometimes at cross-purposes with meeting timing, because to speed things up we may pipeline or replicate logic or do other things that grow the design. If the design doesn't fit, then one must look at how the logic was implemented (maybe the SRL16 is a better use of resources than 16 flip-flops!), or one can simply punt and use the next larger device in the family (if the board wasn't already built and that's an option). And if the design does fit, then whether it takes up 80% of the XC3S200AN or 65% of that device doesn't matter, because the best optimization would be to fit into the XC3S50AN, and you need more size reduction to make it fit (and if you need more than 3 BRAMs you're outta luck anyway).
Hamster: I am only moaning about it as in this case it was not what i wanted - the tools were not aware that I was trying to distrbute the routing delay. Using primatives fixed that :-)
My guess is that you were trying to do something outside of the usual synchronous design paradigm for which FPGAs are suited. In that case, all bets are off.
@hamster: "How in VHDL should I tell the synthisis tools to generate a chain of flip-flops and not a shift register? It is starting to look as if I can't, without resorting to primatives or "tricks" that disable the heuristic (like adding a reset)."
I'm not sure of the exact constraint/directive that XST uses, but it should have one similar to "syn_preserve". This can prevent the synthesis tool from absorbing a signal inside a primitive where it is not accessible via jtag, gate level simulation, etc. It may work here to prevent the intermediate signals between the erstwhile registers from being absorbed into a single SRL16.
In short, you need to become familiar with synthesis constraints/directives for your synthesis tool. They can do lots of things to formally communicate your "intent" on top of the coded behavior. But the result, not counting non-clock-cycle delays, will always match the behavior of the design.
Most synthesis constraints can be specified in the VHDL code by means of custom attributes applied to specific objects. Or they can be specified in separate constraints files to be invoked by the synthesis tools. There are advantages and disadvantages to both methods. Again, refer to your tool's documentation to determine the specifics of either method.
Also, there seems to be some confusion about what happens during synthesis and what happens during placement and routing. The synthesis tool produces a netlist of a circuit that matches your RTL in behavior. It also produces a constraint file to relay the effects of any synthesis constraints/directives to the placement and routing tool.
Most P&R tools will "optimize" the synthesis netlist a little, but not much. Then they place and route the design to meet the specified timing constraints. Static Timing Analysis (STA) is used to verify that the timing constraints are met in the final results.
Some toolsets provide a "physical synthesis" option. This blurs the line between traditional synthesis and P&R tools, by bringing some of the placement awareness into the synthesis tool to allow better, more optimal solutions. Sometimes this is also associated with "design planning" tools.
I know this must seem like you are stumbling around in the dark, finding out about things as bump into them. Be patient, and read the tool documentation. Or ask us!
@hamster "...the implemented behavior would not match the behavior of the input code". That is exactly what I got - my input code included the timing constraints, and it the way it decided to implment it didn't work."
I'll try to be precise. When I talk about "match the behavior of the input code" I mean functional correctness only (as you would verify in a 0-delay simulation).
Of course, you are completely right that a design that doesn't meet area, timing, or power constraints also "does not work".
However, the point I'm trying to convey is that a "design intent" driven tool wouldn't even guarantee functional correctness, and that such a tool would be a methodological nightmare. In spite of this, there are still people who think synthesis works or should work like this, which is why I keep pointing out that it's a bad idea.
The important point to understand is that, for synchronous design, one can rely on synthesis for functional correctness. In particular, any heuristic that it uses to select implementation primitives will (should) never compromise this hard contraint. That is a very good start :-) From there of course, we often have to work hard to meet the constraints - otherwise there is no solution.
When you include the different optimization, mapping, placement and routing options there are litterally billions and billions of synthesis implementations that would match the RTL functionality.
So how does to the tools select which of these billions of options to actually present to me as the final implemented design? Does it generate all the possible solutions and give me the best? No, it has to employ heuristics of some sort
I suggest that it's a lot simpler. The results you get are based on two things: target architecture features and design constraints.
For the former: the synthesizer understands the target architecture. It understands, for example, that a Spartan 3AN FPGA has a multiplier, so when it sees the code inferring a multiplier, it chooses to use that primitive rather than build one (which it would do from its own library of primitives that implement various functions). The synthesizer "knows" about fast carry chains and as such can implement fast adders which use them.
Similarly with combinatorial logic. The synthesizer will do smart logic optimization (basically k-maps) based on best fits to the architecture -- 6-input LUTs vs 4-input LUTs, various extra muxes in the slice.
And then we have timing constraints. Without a period constraint, the tools are free to implement and route the logic in any way that it sees fit, so it probably chooses a lazy default and makes no attempt to optimize routing. With a given period constraint it attempts to pack related logic as close together as possible (probably using some standard geometry things). The tools know the loading on each line, so some registers may be replicated to improve fanout (and also ease routing). The tools also know the amount of routing resources available.
And remember that the tools don't search for the "best" solution or the "optimal" solution. They do what is necessary to meet the given timing constraint. If you tell the tools that some logic must run at 100 MHz, they won't struggle to meet 200 MHz. If the tools come up with a design that uses 60% of your target device, there's no point in trying to get them to fit in 50% of the target device.
OK, not simple. But the tools really do know everything about the target device and that information is used to advantage.
@Jan "Well no, it didn't ignore the behavior at all. The synthesis implementation will match the RTL functionally in all cases - simulate it and you'll see."
When you include the different optimization, mapping, placement and routing options there are litterally billions and billions of synthesis implementations that would match the RTL functionality.
So how does to the tools select which of these billions of options to actually present to me as the final implemented design? Does it generate all the possible solutions and give me the best? No, it has to employ heuristics of some sort
"...a heuristic is a technique designed for solving a problem more quickly when classic methods are too slow, or for finding an approximate solution when classic methods fail to find any exact solution. This is achieved by trading optimality, completeness, accuracy, and/or precision for speed."
The heuristic that goes something like "if you have a chain of registers in a chain, without a reset, and with only one input and output then replace it with a LUT based shift register" is generally a good one - it uses less registers, less routing, and so on. It is the obvious thing to do. In my little case it was not what I wanted as it lumped all the routing delays into the input and output paths, and the design will not meet timing if the routing delay is not spread out...
How in VHDL should I tell the synthisis tools to generate a chain of flip-flops and not a shift register? It is starting to look as if I can't, without resorting to primatives or "tricks" that disable the heuristic (like adding a reset).
"...the implemented behavior would not match the behavior of the input code". That is exactly what I got - my input code included the timing constraints, and it the way it decided to implment it didn't work. I had to give it a hand to guide it to implement it the way it should have done it.
The tools could be programmed that "oh, and should the timing of a path to /from a shift register fail, then split the shift register in half and try again".
@neilla "After you left I had to look at this problem, and found out it was all due to an asychronous reset going in to a state machine. It all worked fine after synchronising the reset."
Thanks for pointing that out. Newbies should be confident that synchronous design and synthesis are reliable. They should understand that anyone trying to tell them otherwise is a false prophet that should be ignored.
One of the things I've been wondering is whether or not the "okWireOR" module is really just a giant OR, or if the order in which things are attached matters.
I am convinced that every country has its own governmental office buried deep down in some dark, dank basement -- on this office door is a single word: "They!"
In this column we will extend our Opal Kelly FrontPanel interface to have two "okWireOut" devices in the FPGA... or maybe not depending on your point of view.
To save this item to your list of favorite All Programmable Planet content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.