Home    Bloggers    Messages    Webinars    Resources   
Tw  |  Fb  |  In  |  Rss
William Murray

Brownout, Reset & Oscillator Issues With FPGAs

William Murray
Page 1 / 2   >   >>
Adam Taylor
Adam Taylor
2/25/2013 5:01:02 AM
User Rank
Blogger
Re: Immunity Aware Programming
Interesting link, I this area is one of the ones I will be talking about at Design West

50%
50%
William Murray
William Murray
2/25/2013 1:17:39 AM
User Rank
Blogger
Re: Immunity Aware Programming
There is a company in the US that has many patents in this area -- they sell boards and software to NASA.

50%
50%
hamster
hamster
2/24/2013 10:48:35 PM
User Rank
Blogger
Immunity Aware Programming
I stumbled across this while lost on Wikipedia.

Immunity Aware Programming - how to help address SEUs at the software layer.

http://en.wikipedia.org/wiki/Immunity-aware_programming

50%
50%
William Murray
William Murray
2/21/2013 8:08:28 AM
User Rank
Blogger
Re: brownout, POR, and SEU?
Many of you are now probably wondering -- If a top level unit has an MTBF of 4,000 or even 40,000 hrs, how are commercial aircraft so safe?  The answer is redundant methods of making a system safe, that all sum up to the probability of 1 failure, per billion hrs of flight time per system(Example -- Engine--Solution -- More than one, plus an Axilliary Power Unit, Ram Air Turbine, and Batteries to operate the flight controls/instruments, in the event of engines failing) (A large jet at 30,000ft can glide for about 160-200miles with no fuel)

50%
50%
Adam Taylor
Adam Taylor
2/20/2013 3:48:46 AM
User Rank
Blogger
Re: brownout, POR, and SEU?
"Other ways to reduce these effects is to leave a bit more thermal and timing margin in the design than one needs."

Derating of components is one the best ways to achevie a reliable solution, we also do a worse case analysis taking into account temperature, aging and radiation to ensure that the device will function at the end of life as well as at the beginning.

For vibration many companies now provide daisy chain devices, which can be placed on realistic implementations of your boards located in the same positions as the production devices and then subjected to shock, vibration and thermal cycling to ensure you do not have any issues. This does add to the cost though.

There also some very interesting BGA prognositics about for FPGA which can detect failures in FPGA BGA's

50%
50%
William Murray
William Murray
2/16/2013 9:16:08 AM
User Rank
Blogger
Re: brownout, POR, and SEU?
@ab6vu -- The very real issue is that one might put an FPGA design in a Unit that has a failure rate of once every 400million hrs with TMR and other mitigation techniques, but as you say once you add all the EMI, Power, and Sensor interface circuits the Unit winds up with a failure rate of once every 5,000 hrs, -- requiring replacement 8 times in the life on an aircraft flying 40,000 hrs.  Getting and screening all the other parts required to make a design function in the real world can be a huge challenge for a project team.

50%
50%
William Murray
William Murray
2/16/2013 8:57:54 AM
User Rank
Blogger
Re: brownout, POR, and SEU?
Xilinx WP395 on the topics of electromigration, hot electron effects, and Negative Bias Temperature Instability -- 

Today, relative to the traditional failure mechanisms, soft error rates dominate, e.g.,
aging, electromigration, hot electron affects, and Negative-Bias Temperature
Instability (NBTI). This is easily demonstrated when comparing the charge stored in a
memory cell to the charge deposited by a neutron reaction product or an alpha
particle. At 28 nm, the stored charge is less than 1 femto-coulomb (1e
-15
coulomb).
Neutron reaction products can deposit up to 150 femto-coulomb, so upsets in the cell
can be common if nothing is done in the design to protect the cell from upsetting.
Starting with Virtex®II FPGAs, which had a soft error rate of 405 FIT/Mb (meaning
10
6
configuration bits), Xilinx embarked on a program to ensure that the device failure
rate was kept low even in spite of the industry process moving toward higher failure
rates. With Virtex-6 FPGAs, the failure rate is 160 FIT/Mb. The 7 series FPGAs have
been tested and show a rate of about 100 FIT/Mb. See Figure 2. For the latest data, see
UG116, Device Reliability Report [Ref 4].

 

http://www.xilinx.com/support/documentation/white_papers/wp395-Mitigating-SEUs.pdf

50%
50%
William Murray
William Murray
2/15/2013 10:14:26 PM
User Rank
Blogger
Re: brownout, POR, and SEU?
For those unfamiliar with the term electromigration, this is where with increasing heat and voltage applied, the doping materials, and metal on a Die begin to diffuse  into the adjoining locations on the Die.  This can alter both the performance and life of the part.  Some parts that have a 7 year service design life at 70C(commercial maximum temperature) might only have a 2 year life due to electromigration at 125 or 150C depending on how the part is made.  One must ask this question of each vendor when doing a high rel design.  

Also, one has to factor in the Flash Memory for the devices Firmware storage.  The Flash parts will lose charge at a faster rate at high temperature, thus resulting in data loss in the Flash.  Some new Flash devices are only good for about a year without refresh at 70C.  With Xilinx Automotive parts it is often best to look for an Automotive, or other high temperature flash to endure the high heat.    CRC/ECC can help mitigate this effect also.

Other vendors like the ProASIC Automotive FPGA  were designed for Aerospace/Automotive, and have 10year flash life at 150C, besting even the most stout Automotive microcontrollers which can have a life at that Temperature of 4-6 years.   (A car very rarely lasts for even one years worth of engine turning time at the hottest run temperature due to many other reasons) But large trucks must run for 10X this duration or more with overhauls.

 

50%
50%
William Murray
William Murray
2/15/2013 2:41:34 PM
User Rank
Blogger
Re: brownout, POR, and SEU?
Basically if one loses one or a few decoupling capacitors due to vibration or has a ball come unsoldered or crack loose (if they are small) one can end up with a local power integrity issue that is issolated within the Die to a small section of the FPGA(that Vcore or Vio ball serviced by the HF cap(s) / balls / traces that were lost --

By using TMR one can have the circuit elements replicated in multiple sections of the Die thus reducing the likely hood of a brief localized power integrity issue periodically disrupting circuit operation.  ECC+TMR can have a similar effect on block and other types of RAM withing the device.    

Examples of higher vibration locations that can have high vibration, include: Aerospace, Automotive, Constrution and Mining equipment, Agricultural Vehicles, Rail, and other types of Industral and Transportation Equipment.  

A very real issue is the often small quantities of some types of equipment that are built with even smaller board purchase quantities (to reduce WIP in the factory) mean that many boards can come through production and test with sub-optimal soldering or board fabrication especially for small lot sizes.  

Also on very large production runs there can be test escapes for things like test vectors on a board in production that do not always meet timing or some other key parameter that is not checked for a path.  In practice even when one builds 10 million units of something there are always a few units that have "weird defects" that are due to testing limitations due to test time reductions after ramp-up.

Other ways to reduce these effects is to leave a bit more thermal and timing margin in the design than one needs.  This has to be weighed in mass production with the cost / availabilty of better speed grade parts.

 

Examples of this I personaly know of include a rear view mirror on a SUV that has to be reset by cycling the ignition when a rare event of this nature causes it's display to flash garbage (distracting the driver) and a Pickup that periodcally has to have the Computer firmware reloaded to clear a fault that becomes an issue in the system.  Due to these only factors occuring due to the age of the vehicle, and this not existion on other same model vehicles of the same year and equipment type -- the only two issues I can think of that might cause this are:

 

1) Vibration overtime related fault (cracking, parts loose)

2) Thermal cycling over time related fault (electromigration or cracking)

50%
50%
ab6vu
ab6vu
2/15/2013 2:22:54 PM
User Rank
Guru
Re: brownout, POR, and SEU?
William,


Yes, in the world of "what can go wrong" soft errors are a possibility.

But, how likely?  If we look at Zynq, say the 7020, which is on the Zedboard, if you use all the programmable logic, all the processing (ARM A9 cores, floating point units, all the hard IP), a typical design, with no mitigation, no detection, no correction, is estimated to have a functional failure every ~ 100 years (~1,000 FIT).

Yes, this is for sea level, New York City, so if you in Denver, it is ~ 4X worse (25 years, 4,000 FIT).

But, in a complex system, you have sensors, indicators, actuators, power supplies, and so on.  What are all those (very real) failure rates?  And those rates are not "soft" (go away and leave no damage, recoverable) but hard, or permanent?

And while I am at it (accounting), if you enable parity interrupts on the procesor, ECC on external memories, SEU Monitor IP in the programmable logic, the 1,000 number decends to perhaps less than 10 FIT (>10,000) years for an "undected" failure (all other failures allow the system to restart, correct, and continue, or stop safely).

For all the (bad) press SEU's get, it is comforting (to me) that not one customer had a a system fail in the last ten years due to a SEU.


Now, did one fail from a SEU and they didn't know why?  Maybe...

Understand that most network equipment providers do log all SEU events in our devices, so they are well aware of upsets that happen (and don't cause any problems).

The point being that Xilinx has dedicated the effort to be sure that soft errors NEVER give rise to a (customer's) customer facing failure since we became aware of the issue in 2002.

50%
50%
Page 1 / 2   >   >>
More Blogs from William Murray
When extreme thermal cycling causes circuit boards and chip packages and the silicon die in the packages to expand and contract at different rates, problems may ensue.
In order to simulate a design we need models that represent the functionality and timing characteristics of our design elements, but the timing aspects of these models may be based on uncertain data.
A large amount of skill is required to write custom test code for custom hardware. Even more skill is required to test a CPU, RAM, or ROM.
Designing high-temperature electronics can present many challenges for "down-hole" petroleum equipment, ovens and micro-waves, automotive, medical, aerospace, and other applications.
If a design does not meet a size, speed, reliability, flexibility, or power constraint, a little code refactoring may be all that's required.
flash poll
follow us on twitter
follow Xilinx on twitter
like us on facebook
like Xilinx on facebook
All Programmable Planet     About Us     Contact Us     Help     Register     Twitter     Facebook     RSS