New tool for FPGA designers mitigates soft errors within synthesis
Implementing synthesis-based mitigation offers a redux in the radiation upsets and soft errors that plague shrinking FPGA geometries.
New technology to protect against radiation-induced upsets and soft errors in FPGAs automatically adds
triple modular redundancy or safe Finite State Machine (FSM) encoding at synthesis. The approach
addresses an inexorable trend associated with Moore’s Law: As silicon geometries continue to shrink, the
possibility of soft errors or radiation effect upsets increase dramatically not only for aerospace applications, but also for FPGA designs destined for ground-level applications.
Radiation effects that can disrupt system operation and lead to circuit failure have long challenged engineers designing FPGAs for aerospace and high-reliability applications. Now, as silicon geometries continue to shrink, leading in turn to increased circuit density, these same radiation effects can potentially plague designs even at ground level. Yet even as the problem looms larger, a new technique to automatically add redundancy and other protection at synthesis stands to make things at least a little bit easier.
The alternative approach to adding redundant circuitry to a device is manually coding mitigation features in HDL, which does work despite being difficult, error prone, and not easy to verify. In addition, this approach often causes design performance to suffer quite a bit. For example, a simple counter can usually be coded in 13 lines of code. Manually triplicating all the logic takes more than 60 lines of code, including special coding structures that require specific, and most likely scarce, knowledge. Deep knowledge of the synthesis tool is also needed to prevent it from optimizing away the new logic a developer just sweated to add by hand.
By contrast, synthesis-based mitigation – which adds either multi-vendor, multi-mode Triple Modular Redundancy (TMR), or Single Event Upset (SEU)-detect safer Finite State Machines (FSMs) – requires less expertise in both HDL coding and the tool itself. The bottom line: Automatically inserting mitigation technology during RTL logic synthesis reduces the risk of functional errors and excessive impacts on performance and area utilization.
Multi-vendor, multi-mode Triple Modular Redundancy
Synthesis-based multi-vendor, multi-mode TMR comes in three forms. Deciding on which one to use depends mostly on the application’s mitigation requirements. In general, it’s safe to declare TMR to be a commonly used and proven methodology for radiation mitigation of Single Event Effects (SEEs) in FPGA devices. The methodology can be very effective but is not perfect, and, as is true for all protection options, its limitations must be weighed against other design considerations, such as where the FPGA will be used. For example, many designers of space systems are at pains to guard against hard errors, including single-event latch up, single-event gate rupture, single-event burnout, and total ionizing dose. Such errors are often best prevented with antifuse, flash, or SRAM hardened architectures
In contrast, FPGAs are used in a host of environments that, though less severe, still pose the risk of radiation-induced soft errors. These don’t permanently damage a device, but rather flip a digital “1” or “0” and thus snarl sequential or combinational logic. Synthesis-based mitigation that automatically adds either TMR or safer FSMs is most suited to soft errors.
TMR reduces single points of failure by triplication. Determining which units to triplicate will depend on the location of radiation-related weaknesses in the device. The three different TMR schemes described below protect different structures and guard against varying effects.
The first and most basic method is referred to as local TMR. Protecting sequential elements from SEUs involves tripling and then voting on registers and other sequential elements such as embedded RAM blocks, sequential DSP blocks, and shift-register blocks. The majority voter (Figure 1) votes out a single bit error, which is essentially masked by the other two correct signals.
Local TMR (Figure 2) only protects the sequential elements from SEUs, and, therefore, leaves the global clock network vulnerable. Additionally, the combinatorial logic, including the voters, remains susceptible to Single Event Transients (SETs). An SET occurs when an SEE causes a glitch to propagate through the circuit.
The second of the three TMR methods, distributed TMR, further protects against SETs in the combinatorial logic and voters (Figure 3). In addition to triplicating the sequential logic, distributed TMR also triplicates the combinatorial logic, including the voters. This eliminates the single points of failure in the combinatorial paths and voters, but leaves the global resources, such as clocks and resets, still vulnerable. Distributed TMR protects against both SEUs and SETs in the combinational logic, and also against configuration SRAM upset for user logic and data routes. SET protection may be a concern at high enough Linear Energy Transfer (LET). An added benefit of distributed TMR is that an SEU in the configuration memory of any of the data cones will be masked by the other two cones and voters.
A third TMR method, global TMR (Figure 4), includes all the mitigation of distributed TMR, where registers, combinational logic, and voters are triplicated, plus triplication of global buffers such as clocks and resets. This scheme outright eliminates the single points of failure in the user logic.
Besides adding various forms of TMR, it’s now possible to automatically implement safer FSMs with synthesis. Safer FSMs may be preferred when protecting control logic is deemed to be the most critical mitigation goal. The basic principle behind a safe FSM is preventing the state machine from getting stuck in an unknown state due to an SEU. Consider a simple binary encoded FSM that uses only three states. Most synthesis tools run in default mode will optimize away all states that are unspecified or otherwise considered unreachable. Thus, when the device experiences an SEU and one of the register bits is inverted, the FSM can be put into an undefined, invalid state, which locks up the circuit.
The simple safe FSM available in most synthesis tools will implement all states, even those not explicitly specified, with defined behavior. However, one issue not accounted for is invalid transitions to otherwise legitimate FSM states, as shown in the simple binary FSM in Figure 5. Allowing these is unacceptable for many applications, hence the simple scheme really only works for one-hot encoding, an area-intensive alternative.
The new synthesis technology can implement “SEU-detect” safer FSMs, where full SEU detection can be done for all encoding schemes, such as area-efficient binary and Gray. The scheme uses a Hamming-2 algorithm that can detect any illegal state transition (Note that an SEU will still interrupt circuit operation, sending the safe FSM to a default/reset state).
A more automated, verifiable approach for NASA
Mentor Graphics Precision Hi-Rel, which launched in June 2010 as Precision Rad-Tolerant, incorporates mitigation circuitry during RTL logic synthesis. Precision’s Triple Module Redundancy technology was developed with guidance from NASA, which has decades of experience in dealing with radiation effects on everything from electronics to astronauts. As confirmed by their own research, most off-the-shelf devices, rad-tolerant or not, need additional protection if the radiation environment is severe enough. For example, devices that protect against SEUs may not protect against SETs at high enough LET levels. And in order to use large-capacity, low-cost, and reprogrammable commercial devices, extra protection typically needs to be provided for both SEU and SET mitigation. While hand-coding TMR is viable, the method is laborious and time consuming. NASA asked for a more automated, verifiable approach to TMR for SRAM-, flash-, and antifuse-based devices. As desired, the synthesis-based approach not only automates SEU and SET-level protection for these device architectures, but also provides an equivalence checking flow to make sure the new circuitry preserves original design functionality.
More a flu shot than a once-in-a-lifetime vaccination
Considering risks from radiation is a tricky business, involving lots of guesswork about everything from the cost of device failure to the likelihood that a scarce energetic particle will hit a device in just such a way as to cause a problem. Two things, though, are certain: First, radiation effects will remain a persistent concern, and not just for aerospace and defense applications, but also for other ground-level high-reliability applications. Denser tangles of circuitry in effect produce bigger targets for those stray particles. Second, while inserting protection at the push of a button is a novel approach, it’s not a foolproof and/or final inoculation. Addressing the issue of radiation will almost always involve weighing a combination of technologies that together meet design, budget, and mitigation requirements.
Mentor Graphics firstname.lastname@example.org www.mentor.com
- J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.