New developments in DSP design
Modern FPGAs feature advanced DSP blocks and DSP resources that support a high degree of data processing and throughput. The challenge is to develop a high-level design flow that allows designers to
unleash the DSP capabilities of FPGAs. There have been several recent developments in DSP tools that have enabled the FPGA industry to excel in design productivity, reusability, fixed and floating point circuitry, and sheer computational throughput.
Many applications demand more DSP performance than DSP processors can deliver, such as radar, wireless, and broadcast video systems. To service these, an alternate technology has become prevalent: the FPGA. FPGAs grew out of their simpler CPLD cousins, and quickly grew in size and I/O capabilities. Once multipliers were introduced into the FPGA logic fabric, they became powerful DSP platforms. FPGAs soon contained hundreds and then thousands of multipliers. The multipliers morphed into DSP blocks that contained special circuitry to support FIR filters and FFTs. But the key advantage was the sheer degree of parallelism offered by FPGAs, which could not be replicated by DSP processors; high-end FPGAs contain billions of multiply accumulates per second compared to DSPs.
FPGAs are programmable hardware, and, with some exception, tend to have fairly static hardware configurations within any given design. The data processing tends to resemble a manufacturing assembly line, where each programmable circuit is configured to perform a specific function. This leverages the massive parallelism properties for the FPGA. This also leads to a different sort of design flow using hardware languages Verilog and VHDL. These languages describe the operation of the hardware at a fairly detailed level. They also accommodate the much higher degree of freedom offered by FPGA designs. In contrast to processors, the FPGA designer can use custom word widths tailored to the needs of the algorithm at each stage of processing. FPGAs offer thousands of memory blocks, allowing for easy storage of interim data states without the bottleneck of a single or few memory buses and interfaces used in processor architectures.
However, this degree of freedom tends to also make FPGAs more difficult to program, akin to DSPs in the assembly language era. Efforts have been made to develop C compilers for FPGAs, but they have had limited adoption to date. The developers of these “C-to-gates” tools face the challenge of describing parallel hardware architecture using a software language that is aimed at serial processing. When modifications are made to the software language to allow the designer to describe this parallelism, it tends to work against the inherent ease of use and familiarity of the programming language. Another challenge, similar to that experienced with DSP processor C compilers in the past, is being able to achieve the same level of FPGA efficiency and performance possible for designers using HDL languages Verilog or VHDL. Achieving the same level of performance with a high-level design flow is much more challenging than with DSP processors, again due to the many degrees of freedom offered by the FPGA architecture. Several recent developments in DSP tools can address these challenges of FPGA design, including model-based design and matrix processing.
An alternative FPGA design flow called model-based design utilizes MathWorks’ Simulink environment. The appeal of Simulink is that it understands the concept of clock cycles and allows for the description of hardware architectures, yet provides the fully integrated environment of MATLAB and Simulink to perform modeling and simulation of both the test bench and the system under development. Product offerings from Altera, MathWorks, and Xilinx allow development of FPGAs using an integrated tool flow where the designer operates largely within the MathWorks tool environment. The Altera and Xilinx Simulink-based tool flows naturally allow the designer to target only their respective FPGA products, while Simulink HDL coder allows the designer to target Altera or Xilinx FPGAs, or generic hardware for ASIC designers.
The various Simulink design flows tend to mimic the same techniques used in Verilog and VHDL. The designer specifies fixed-point word widths for each operation, determines the number of register delays or pipelining at each stage of the design, and utilizes local memory structures of various widths and depths. The tool output then passes to the FPGA vendor’s synthesis and place and route tools, after which the circuit Fmax and logic resource utilization can be determined.
To introduce additional capability, another alternative allows the designer to describe what they want to do at an algorithmic level, and automate the many levels of optimization required to achieve high performance in an FPGA rather than trying to describe the design at the same level as HDL languages. For example, take a simple adder circuit. Normally, the designer needs to determine how much to break apart the adder chain and how many levels of pipeline registers need to be inserted. Within the Simulink environment, the designer describes an adder of any length (number of bits), the FPGA family used, and the Fmax at which the design is to run. The tool has access to all device timing details, so it can decide how much to break the adder chain and insert pipeline register stages in order to meet the designer’s Fmax requirements. This process is repeated throughout the entire design and frees the designer from the tedious process of optimizing the design to the latest device features or pipelining to meet the required clock rate (known as closing timing). This works well for small designs, such as individual FIR filters that can run at over 450 MHz, and large designs that can typically operate in excess of 350 MHz, which is on par with skilled HDL designer results. This results in a high degree of design portability and even “future proofing” of designs. The design itself is not optimized for a particular FPGA, but the tools can target a design for low-cost, mid-range, or high-end FPGAs (with allowances for reasonable Fmax). It will also allow a design to be optimized for future device families and features that do not exist today.
Multiple channels, folding, and polyphasing
Another development is the ability to perform multiple channels on any algorithmic datapath in a parameterizable way. The tool will automatically schedule the different channels to share resources and generate the necessary control logic. In fact, multi-channel often benefits the tool, as the multiple channels allow further pipeline registering to be done using the interleaved register delays necessary to accommodate the multiple channels, particularly for recursive designs. Folding, or use of the same resource for multiple operations within the same data stream, can also be performed as needed to reduce logic and multiplier resources. The contrasting technique, known as polyphasing, can be performed automatically, which allows the FPGA to interface to data convertors operating at multiple Giga Samples per second (GSps).
Floating point circuits and matrix processing
The ability to generate floating point circuits is one of the most significant advances now available to FPGA designers. Traditionally, FPGAs are not used for computing algorithms and linear algebra applications, as the tools and IP available were inefficient and too low performance. The dynamic range and numerical fidelity of floating point were simply not available to FPGA designers. This advance in FPGA tools changes this paradigm by offering high-performance floating point processing across a design in the largest FPGAs.
One of the biggest applications for floating point in FPGAs is matrix processing. Support for vector data types is available, allowing the designer to manipulate and perform operations upon entire vectors rather than describing each element in the vector. For example, a large floating point vector dot product, which is the fundamental building block of many matrix algorithms, can be described simply in a couple of multiply and add blocks.
DSP Builder from Altera Corporation allows designers to build their own matrix processing cores, which at runtime can support various sizes of matrices. BDTI, a DSP benchmarking company, recently evaluated this tool for high-performance floating point designs. The Cholesky decomposition, including forward and backward substitution processing, was chosen to demonstrate both design methodology and performance. This is commonly used for matrix inversion, particularly for a large class of problems utilizing covariance matrices.
Further benchmarks have been developed using 28 nm FPGAs, showing that six matrix inversion cores, each performing 3,000 matrix inversions (technically Cholesky decompositions) per second upon a [240 x 240] matrix can be implemented within a single large FPGA. This is an aggregate of 18,000 [240 x 240] matrices per second per FPGA. For smaller matrices, such as [20 x 20], a single FPGA can process up to 18,000,000 [20 x 20] matrices per second.
A designer might not need that level of performance, but, for instance, a radar application might need a throughput of 6,000 [240 x 240] matrix inversions per second, which would consume about one- third of a large FPGA’s resources. The remainder of the FPGA could be used for other functions, such as pulse compression, Doppler processing, and covariance matrix estimation. All of these can be implemented in the same FPGA, using a mixture of fixed or floating point processing as required. This alleviates many of the current data flow bottlenecks in complex systems such as modern radar.
On the path to a better DSP tool future
DSP chips now support both floating point and fixed-point circuits, so designers don’t have to choose between high-performance fixed-point DSP chip families or lower-performance floating point DSP families as they have in the past. DSP tools have also come a long way and will continue to evolve. Eventually, many expect that C or software-based descriptions of hardware architectures such as FPGAs will become competitive and ubiquitous. But today, recent design tool developments utilizing Simulink have hit the high water mark for design productivity, reusability, fixed and floating point circuitry, and sheer computational throughput.
Altera Corporation firstname.lastname@example.org www.altera.com