There are a number of issues that can affect your FPGA designs. One very important concern is when the core and/or input/output (I/O) voltages dip below the design limits for an FPGA. One of the first things this causes is the part to slow down, which can lead to timing faults and other difficult-to-find logic errors.
This means that most types of FPGA circuits upon which one must depend should employ some form of independent "brownout" monitoring of all voltages. This function should control a pin that will (a) drive the FPGA into reset and (b) drive the I/O into a safe state. Ideally, the safe state should be defined such that the system as a whole will still operate under brownout conditions. Further food for thought is that individual cards on a backplane can unseat or vibrate their connectors causing momentarily power issues.
Many different types of voltage monitoring circuits are available. TI, Maxim, and many others make these parts for a variety of applicable voltages. For example:
The TPS3808-EP programmable delay supervisory circuit from TI
Another important issue that can affect FPGA designs is the loss of the clock oscillator, which can leave the I/O hung in an unsafe state. There are a couple of techniques that can be used for detecting and handling these events.
The first method is the "windowed watchdog." This uses an external integrated circuit (IC) that implements a windowed watchdog function completely independently of the FPGA -- except for the I/O to "pet the dog" to keep it happy. These ICs also typically monitor one or more voltages as well as perform a power-on reset (POR).
The "petting the dog" can be performed by your control state-machine or by your processor core, which are clocked by your PLL/Oscillator. An example of this type of component is the TPS3813J25 supervisor with programmable watchdog window from TI.
Another method is to use two different oscillator time bases. One then divides down each of these to a nice, even, similar frequency and compares the divide counts one gets versus each other (ratio).
Lastly (for this column), the act of performing a reset can be an issue for many FPGA circuits. This can be especially true when bypassing the PLL and using a direct oscillator drive to the part. In this scenario, prior to releasing the FPGA to run wild and free, one must give the oscillator time to come up and get on frequency and duty cycle (see my earlier blog -- The Right FPGA Reset for the Right Purpose -- for more details on this.)
Which of the issues presented here do you find yourself facing? And how do you solve them?
Many of you are now probably wondering -- If a top level unit has an MTBF of 4,000 or even 40,000 hrs, how are commercial aircraft so safe? The answer is redundant methods of making a system safe, that all sum up to the probability of 1 failure, per billion hrs of flight time per system(Example -- Engine--Solution -- More than one, plus an Axilliary Power Unit, Ram Air Turbine, and Batteries to operate the flight controls/instruments, in the event of engines failing) (A large jet at 30,000ft can glide for about 160-200miles with no fuel)
Adam Taylor 2/20/2013 3:48:46 AM User Rank Blogger
Re: brownout, POR, and SEU?
"Other ways to reduce these effects is to leave a bit more thermal and timing margin in the design than one needs."
Derating of components is one the best ways to achevie a reliable solution, we also do a worse case analysis taking into account temperature, aging and radiation to ensure that the device will function at the end of life as well as at the beginning.
For vibration many companies now provide daisy chain devices, which can be placed on realistic implementations of your boards located in the same positions as the production devices and then subjected to shock, vibration and thermal cycling to ensure you do not have any issues. This does add to the cost though.
There also some very interesting BGA prognositics about for FPGA which can detect failures in FPGA BGA's
@ab6vu -- The very real issue is that one might put an FPGA design in a Unit that has a failure rate of once every 400million hrs with TMR and other mitigation techniques, but as you say once you add all the EMI, Power, and Sensor interface circuits the Unit winds up with a failure rate of once every 5,000 hrs, -- requiring replacement 8 times in the life on an aircraft flying 40,000 hrs. Getting and screening all the other parts required to make a design function in the real world can be a huge challenge for a project team.
Xilinx WP395 on the topics of electromigration, hot electron effects, and Negative Bias Temperature Instability --
Today, relative to the traditional failure mechanisms, soft error rates dominate, e.g., aging, electromigration, hot electron affects, and Negative-Bias Temperature Instability (NBTI). This is easily demonstrated when comparing the charge stored in a memory cell to the charge deposited by a neutron reaction product or an alpha particle. At 28 nm, the stored charge is less than 1 femto-coulomb (1e -15 coulomb). Neutron reaction products can deposit up to 150 femto-coulomb, so upsets in the cell can be common if nothing is done in the design to protect the cell from upsetting. Starting with Virtex®II FPGAs, which had a soft error rate of 405 FIT/Mb (meaning 10 6 configuration bits), Xilinx embarked on a program to ensure that the device failure rate was kept low even in spite of the industry process moving toward higher failure rates. With Virtex-6 FPGAs, the failure rate is 160 FIT/Mb. The 7 series FPGAs have been tested and show a rate of about 100 FIT/Mb. See Figure 2. For the latest data, see UG116, Device Reliability Report [Ref 4].
For those unfamiliar with the term electromigration, this is where with increasing heat and voltage applied, the doping materials, and metal on a Die begin to diffuse into the adjoining locations on the Die. This can alter both the performance and life of the part. Some parts that have a 7 year service design life at 70C(commercial maximum temperature) might only have a 2 year life due to electromigration at 125 or 150C depending on how the part is made. One must ask this question of each vendor when doing a high rel design.
Also, one has to factor in the Flash Memory for the devices Firmware storage. The Flash parts will lose charge at a faster rate at high temperature, thus resulting in data loss in the Flash. Some new Flash devices are only good for about a year without refresh at 70C. With Xilinx Automotive parts it is often best to look for an Automotive, or other high temperature flash to endure the high heat. CRC/ECC can help mitigate this effect also.
Other vendors like the ProASIC Automotive FPGA were designed for Aerospace/Automotive, and have 10year flash life at 150C, besting even the most stout Automotive microcontrollers which can have a life at that Temperature of 4-6 years. (A car very rarely lasts for even one years worth of engine turning time at the hottest run temperature due to many other reasons) But large trucks must run for 10X this duration or more with overhauls.
Basically if one loses one or a few decoupling capacitors due to vibration or has a ball come unsoldered or crack loose (if they are small) one can end up with a local power integrity issue that is issolated within the Die to a small section of the FPGA(that Vcore or Vio ball serviced by the HF cap(s) / balls / traces that were lost --
By using TMR one can have the circuit elements replicated in multiple sections of the Die thus reducing the likely hood of a brief localized power integrity issue periodically disrupting circuit operation. ECC+TMR can have a similar effect on block and other types of RAM withing the device.
Examples of higher vibration locations that can have high vibration, include: Aerospace, Automotive, Constrution and Mining equipment, Agricultural Vehicles, Rail, and other types of Industral and Transportation Equipment.
A very real issue is the often small quantities of some types of equipment that are built with even smaller board purchase quantities (to reduce WIP in the factory) mean that many boards can come through production and test with sub-optimal soldering or board fabrication especially for small lot sizes.
Also on very large production runs there can be test escapes for things like test vectors on a board in production that do not always meet timing or some other key parameter that is not checked for a path. In practice even when one builds 10 million units of something there are always a few units that have "weird defects" that are due to testing limitations due to test time reductions after ramp-up.
Other ways to reduce these effects is to leave a bit more thermal and timing margin in the design than one needs. This has to be weighed in mass production with the cost / availabilty of better speed grade parts.
Examples of this I personaly know of include a rear view mirror on a SUV that has to be reset by cycling the ignition when a rare event of this nature causes it's display to flash garbage (distracting the driver) and a Pickup that periodcally has to have the Computer firmware reloaded to clear a fault that becomes an issue in the system. Due to these only factors occuring due to the age of the vehicle, and this not existion on other same model vehicles of the same year and equipment type -- the only two issues I can think of that might cause this are:
1) Vibration overtime related fault (cracking, parts loose)
2) Thermal cycling over time related fault (electromigration or cracking)
Yes, in the world of "what can go wrong" soft errors are a possibility.
But, how likely? If we look at Zynq, say the 7020, which is on the Zedboard, if you use all the programmable logic, all the processing (ARM A9 cores, floating point units, all the hard IP), a typical design, with no mitigation, no detection, no correction, is estimated to have a functional failure every ~ 100 years (~1,000 FIT).
Yes, this is for sea level, New York City, so if you in Denver, it is ~ 4X worse (25 years, 4,000 FIT).
But, in a complex system, you have sensors, indicators, actuators, power supplies, and so on. What are all those (very real) failure rates? And those rates are not "soft" (go away and leave no damage, recoverable) but hard, or permanent?
And while I am at it (accounting), if you enable parity interrupts on the procesor, ECC on external memories, SEU Monitor IP in the programmable logic, the 1,000 number decends to perhaps less than 10 FIT (>10,000) years for an "undected" failure (all other failures allow the system to restart, correct, and continue, or stop safely).
For all the (bad) press SEU's get, it is comforting (to me) that not one customer had a a system fail in the last ten years due to a SEU.
Now, did one fail from a SEU and they didn't know why? Maybe...
Understand that most network equipment providers do log all SEU events in our devices, so they are well aware of upsets that happen (and don't cause any problems).
The point being that Xilinx has dedicated the effort to be sure that soft errors NEVER give rise to a (customer's) customer facing failure since we became aware of the issue in 2002.
When extreme thermal cycling causes circuit boards and chip packages and the silicon die in the packages to expand and contract at different rates, problems may ensue.
In order to simulate a design we need models that represent the functionality and timing characteristics of our design elements, but the timing aspects of these models may be based on uncertain data.
Designing high-temperature electronics can present many challenges for "down-hole" petroleum equipment, ovens and micro-waves, automotive, medical, aerospace, and other applications.
To save this item to your list of favorite All Programmable Planet content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.