Achieving performance targets with multi-die FPGA-based prototyping hardware in the face of design changes
Combining a flexible approach to hardware prototypes with an advanced design planning flow enables design teams to implement large, multi-chip prototyping systems using the latest stacked-die FPGAs.
FPGA capacity has increased by more than 6 times in the past five years, and the latest generation of FPGAs offers design teams unprecedented capacity to integrate very large systems. For example, Xilinx’s latest FPGA devices, which use Stacked Silicon Interconnect (SSI) technology, provide up to an estimated 12 million equivalent ASIC gates and 16 x 12.5 Gbps serial transceivers on a single device. Design teams can boost capacity further by combining multiple FPGAs on a prototyping system, and beyond that by assembling multiple systems into a single aggregate system.
Design teams will need to adapt their methodologies to take advantage of this leap in FPGA capacity, while attaining their timing performance targets.
SSI technology enables vendors to create high-capacity FPGAs by combining multiple dies, sometimes called Super Logic Regions (SLRs), onto a single package substrate. Xilinx’s Virtex-7 devices, for example, incorporate four SLR components (SLR0 to SLR3). Each SLR contains the active circuitry common to most FPGA devices, including look-up tables, registers, I/Os, gigabit transceivers, memory, and other components. The SLRs are arranged in a “stacked” configuration, with SLR0 connected to SLR1, SLR1 connected to SLR2, and SLR2 connected to SLR3. A silicon interposer carries the high-bandwidth, low-latency connections between each SLR.
Stacked dies provide the building blocks to enable high-capacity FPGAs, but they also add an extra dimension of routing complexity for design teams using such devices to create high-performance ASIC prototypes.
FPGA-based prototyping boards typically use fixed interconnects or PCB traces to route between FPGA devices. The use of fixed interconnects constrains the way design teams can partition their logic when using SSI devices.
The left side of Figure 1 shows an SSI device with four SLRs and two fixed connections: one connection on bank 40 and a second on bank 12. In this case, the logic is partitioned between two SLRs (SLR1 and SLR2) and the fixed connections dictate that the assigned banks happen to be on SLRs at the opposite ends of the stack – SLR0 and SLR3. This configuration requires that the critical path, shown by the red line, to cross three SLR boundaries. SLR crossings impact performance because they route signals off chip and incur extra delays, which limits the maximum system frequency. A cross-SLR delay between two SLRs can be up to 13 times greater than the equivalent path delay when the path is maintained on the same SLR, with the exact timing penalty depending on many factors including net fanout.
The design team must change the system setup if they require (but cannot achieve) a certain speed to accommodate a real-world interface. However, using fixed connections limits the available routing options because the design team cannot tailor the connections to the design partition.
The right side of Figure 1 shows what a design team can achieve with a flexible system architecture, facilitated using high-performance cable I/O connectors instead of fixed connections. Simply moving the input from bank 40 to bank 32 enables the design planning tools to contain the timing-critical path within a single SLR, and, as a result, reduce the amount of routing. Using high-speed cables enables the design team to choose whichever pins make most sense in order to minimize the delays through the critical path.
Architectures that offer flexible connectivity using high-speed cables to connect to real-world interfaces also provide better support for wide buses, which may require the use of more I/O banks. Using fixed interconnects leads to implementations that scatter the bus pins across different SLRs, which impairs the performance of the bus. By using cables at the system level, the design team can choose a bank assignment that keeps the bits in a bus together within a single SLR.
Raw speed: Cable vs. PCB traces
Cables offer much more flexibility compared to PCB traces on an FPGA-based prototyping system. However, if the aim of using cables is to avoid performance issues caused by crossings between SLRs, it’s fair to ask how the cables themselves affect performance, compared to using traditional PCB traces.
The good news is that there is no appreciable penalty for either transmission speed or latency by using high-quality cables, compared with dedicated PCB traces.
The Synopsys HAPS-70 all cable-based interconnect architecture uses high-performance ribbon coaxial cables, which have excellent performance and superior characteristics in terms of cross talk and signal velocity (Table 1).
Facing up to design changes
The methodology shown in Figure 2 is common to many prototyping projects, whether using a board manufactured by a vendor or a “Build-Your-Own” (BYO) FPGA board-level solution. Design teams need to be able to accommodate design changes, especially when working with very large designs, where managing incremental changes is essential for a productive flow.
Design changes that impact adjacent SLRs can have implications for pin assignment and therefore risk incurring unwanted delays. Designers want to avoid a ripple effect, where a small change to one part of the design has a big impact on the rest of the design. For example, making space for some new logic within an SLR may require the design team to reassign the I/Os when using a fixed architecture, which will have implications for logic in adjacent SLRs. To accommodate the inevitable design changes incurred on large designs, it is very desirable to be able to have control over the way designs are partitioned between SLRs.
When making incremental changes, design teams aim to shorten the iteration loop in Figure 2 to the fewest possible number of steps, in order to reduce the schedule impact. It will be much faster to accommodate changes by tweaking the partition, rather than having to go back to change the RTL.
Methodology and architecture working together
Using a methodology that brings together synthesis and design planning, designers can constrain a critical path within a particular SLR. Once constrained, the architecture will allow for that constraint by simply moving the cable to a bank within the SLR.
Such a methodology also allows designers to “lock” a particular part of a design into a specific SLR, which minimizes the effects of any design changes by containing them within the SLR. This is essential to support productive incremental design. When a design team has invested a lot of time in finalizing the majority of the design, they can keep “work-in-progress” changes to a single SLR, ensuring that further changes don’t disrupt the rest of the FPGA.
Synthesis and floorplanning
For a prototyping environment that supports a flexible architecture with incremental changes, the Synopsys FPGA solution, Synplify Premier Design Planner (DP), integrates with the place-and-route technology in Xilinx’s Vivado Design Suite (Figure 3).
The FPGA place-and-route environment will automatically partition a design across the FPGA dies for optimal results. Designers can, if they wish, pass partition constraints to place and route using the XDC equivalent of “area_groups”.
Design teams that wish to use an incremental flow in order to lock down logic or a critical path to one or more die can use the Synplify Premier DP synthesis partitioner within the synthesis tools.
Using synthesis with design planning enables design teams to assign critical paths to a specific region or die. By assigning the instances in the path to a specific die using a design-planning tool, design teams can avoid the timing failures that occur when extra inter-die delay is introduced. Design planning will automatically generate physical constraints that instruct the place-and-route software to keep the path on the same die.
The synthesis tool passes constraints to the FPGA place-and-route environment to preserve placement, ensure relative placement or groupings to preserve quality of results, and assign logic to a specific die for better quality of results or design preservation. Design teams can leave partitioning to the tools or create user-defined regions, which must be fully contained within SLR regions and cannot cross SLR boundaries. Designers can easily assign instances (modules or components) to an SLR by selecting the target instance in an RTL view.
Today’s multi-die FPGAs give designers access to unprecedented numbers of equivalent ASIC gates. To take advantage of such large FPGA capacity requires an incremental design approach in order to accommodate design changes without undue impact to the project schedule and meet timing.
Use of a flexible FPGA-based prototyping architecture allows design teams to assign I/Os to suit their partitioning. Having a flexible architecture enables designers to apply physical constraints to force timing-critical paths onto one die or a specific region of a die. The design planning tools can forward annotate constraints to the place-and-route software. Whether performed automatically or manually, being able to control the assignment of logic and paths to SLRs shortens the iteration loop, which allows design teams to adopt incremental design while continuing to meet timing constraints.