How to accelerate a simple, 16-bit, 12-tap DSP FIR filter by compiling it into FPGA hardware

Software developers sometimes need to move C code to parallel FPGA hardware.

6Mei and Brian begin this tutorial by describing the need for multiple parallel processes and what is involved in an FPGA offering many-more-than-dual-core capacity. Next the authors delineate the steps needed to move a simple Finite Impulse Response (FIR) filter from C to FPGA hardware.

High-performance DSP applications, including video and audio processing, are not so much clock constrained as stymied by a lack of parallel processing. Using multiple parallel processes in hardware at a reduced clock rate generates two major advantages. One, reduced clock speeds cut power consumption, and two, parallelizing the applications can dramatically increase application throughput. However realizing these benefits poses a challenge for DSP application developers, who need to rewrite existing DSP algorithms for parallel implementation.

take well to parallel implementation. Traditionally, these devices have been programmed using lower-level methods including hardware description languages (HDLs), most notably and . More recently, C-to-HDL compilation has become a core capability for many developers. HDLs are the input file format for the place and route tools that create the code which runs on the FPGA. Effectively programming hardware requires software developers to understand how their design propagates in hardware. Software developers should also know how to use the optimizing compiler, as well as the ultimate hardware resources, to cut time to market. This is where C-to-HDL tools can help. This article is intended for software developers who need to move C code to parallel FPGA hardware.

Parallelism

Microprocessors achieve performance by using ever-increasing clock speeds and, more recently, by employing a limited amount of instruction pipelining along with dual or quad cores. FPGAs have long offered the capacity for an amount of cores far exceeding dual. It’s the ability to use optimized parallel processing logic for multiple streaming processes that enables this capacity. The downside is that the path from microprocessor-oriented C to optimized RTL placed and routed for an FPGA involves multiple steps through multiple tools. For the C programmer, this means C must be compiled to HDL, the HDL must be synthesized to lower-level logic, and the low-level logic must be placed and routed in the FPGA. Optimizing an application therefore requires some understanding of FPGA architecture and machine-level optimization, along with some level of intervention on the part of the programmer. To explain we are going to walk through the steps of moving a simple FIR filter from C to FPGA hardware (Figure 1).

21
Figure 1: The producer, FIR and consumer processes are connected together.
(Click graphic to zoom by 1.9x)

Project description – 16-bit, 12-tap FIR filter

The specific process being compiled to hardware is represented by the following function:

void fir(co_stream filter_in, co_stream filter_out)

This C-language subroutine represents a single process, defined as a module of code, expressed as a void subroutine, that describes a hardware or software component. If you are an experienced hardware designer, you can simply think of a process as being analogous to a VHDL entity, or to a Verilog module.

If you are a software programmer, you can think of the process as being a subroutine that will loop forever, in a separate thread of execution from other processes. Our fir function has no return value and has two interfaces that have been defined using co_stream data types. These two streams are used to:

  • Read in a set of 12 filter coefficients, and then a stream of sample data, on the filter_in stream.
  • Write out the filter values on the filter_out stream.

If you are a hardware designer, you can think of a co_stream as being a representation of a first-in, first-out (FIFO) buffer. If you are a software programmer, you can think of a co_stream as being roughly analogous to a FILE type in C. Rather than reading and writing files on a disk, however, we will use the co_stream type to transfer data between multiple parallel processes.

22
Figure 2: The source code showing the algorithm and its nested loops
(Click graphic to zoom by 1.8x)

Notice on Figure 2 that the subroutine includes an outer do-while (1) loop, indicating that the subroutine will execute endlessly. This subroutine describes a persistent, always-running process.

Within this loop, observe how the co_stream_open, co_stream_read, co_stream_write, and co_stream_close functions are used to manage the movement of data through the filter. These functions provide you, the C programmer, with a concise and platform- portable way to express streaming data. Impulse C supports a number of similar functions that can be used to describe the movement and management of process-to-process data.

The fir function begins by reading 12 coefficients from the filter_in stream and storing the resulting data into a local array (coef). The function then reads and begins processing the data inputs, one sample at a time. Results of filtering are written to the output stream filter_out.

In the algorithm description (Figure 3), you will find a while loop that describes the filtering operation, an iterative multiply-accumulate operation.

23
Figure 3: The filtering process is an iterative multiply-accumulate operation.
(Click graphic to zoom)

This loop includes two inner loops and a simple set of calculations to iterate over every 12-sample segment of the incoming data to perform the filtering operation. In each iteration of the while loop, filtered data is written to the output stream using co_stream_write.

The above loop illustrates a very common pattern for describing filters using Impulse C: a C-language loop iterates on the incoming data, some processing occurs on that data, and results are written to the outputs using streaming (as shown here) or other methods.

You probably noticed the use of three pragmas in the code (PIPELINE, UNROLL and SET StageDelay). These pragmas are the subject of a more detailed tutorial on optimization techniques, but to summarize (in the order these pragmas are used in the code):

  • The CO PIPELINE pragma indicates that we want the while loop to be implemented as a hardware pipeline for high throughput. If the hardware compiler is able to generate a perfect pipeline with a rate of 1, then we can expect this loop to iterate in hardware as fast as one sample per clock cycle, even if the computations within the loop require more than one cycle.
  • The CO SET pragma allows us to specify certain characteristics for the generated hardware. In this case we are setting a StageDelay constraint that instructs the optimizer to limit the combinational logic depth of any pipeline stage. If any generated pipeline stage exceeds this constraint, the optimizer will add additional pipeline stages to better balance the pipeline and allow the hardware to operate at a high clock rate.
  • The UNROLL pragma instructs the optimizer to remove (by unrolling) a loop so that all iterations of that loop operate in parallel. Unrolling requires that the loop obey certain rules (such as having a fixed number of loop iterations) but can have dramatic impacts on performance, at the expense of additional FPGA logic being generated.

The FIR filter configuration subroutine

The fir subroutine described above represents the algorithm to be implemented as hardware in the FPGA. To complete the application, however, we need to include one additional routine that describes the I/O connections and other compile-time characteristics for this application. This configuration routine serves three important purposes, allowing us to:

  1. Define I/O characteristics such as FIFO depths and the sizes of shared memories.
  2. Instantiate and interconnect one or more copies of our Impulse C process.
  3. Optionally assign physical, chip-level names and/or locations to specific I/O ports.

This example includes one hardware process (the FIR filter) and also includes the two testing routines described earlier, producer and consumer. Our configuration routine (Figure 4) therefore includes statements that describe how the producer, fir, and consumer processes connect together.

24
Figure 4: A complete configuration routine
(Click graphic to zoom)

To summarize, the fir subroutine describes the algorithm to be generated as FPGA hardware, while the producer and consumer subroutines (described elsewhere, in fir_sw.c) are used for testing purposes. The configuration routine is used to describe how these three processes communicate, and to describe other characteristics of the process I/O.

Compiling the C Code to create HDL

The steps in creating FPGA hardware and related files from the C code begin with selecting a target platform. The platform may be an individual FPGA or an FPGA development board. Device selection is typically via pull down menu that will list the devices supported by the C-to-FPGA compiler. Or you can typically select “generic VHDL” to compile to VHDL without conforming to a specific device architecture. Most HDL generators also offer Verilog. Generation is automatic, creating a project file that includes estimates of loop latencies, pipeline rates, and required hardware operations as shown in Figure 5. These data help the engineer iteratively refactor and improve algorithms before going through a potentially long process of FPGA synthesis, and place and route.

25
Figure 5: The project file includes estimates of loop latencies, pipeline rates, and required hardware operations.
(Click graphic to zoom)

The C to FPGA tool will optimize and process your work, generating HDL files for the FIR filter. Those results are examined using code tools and graphical tools as part of the iterative code improvement process. This HDL file includes the state machines and other logic that implements the parallelized and pipelined operations described in C. The example in Figure 5 includes a pipelined inner code loop with an unrolled loop, which results in a substantial amount of HDL code being generated.

When you examine this generated HDL code, keep in mind that the number of lines of HDL code is not directly related to the size of the FPGA resources. In this case, because of the loop unrolling and pipelining, a large number of intermediate signals are generated by the compiler. These intermediate signals are optimized away by the FPGA synthesis tool, resulting in far less logic than the lines of HDL code might indicate. Note: the amount of FPGA resources and final performance for such a filter will depend on the selected FPGA platform, on the synthesis settings, and on what other hardware elements are being combined with this filter in the complete system. In the case of this algorithm (a 16-bit, fully pipelined and parallelized 12-tap filter), you can expect to use approximately 12 DSP slices in a typical FPGA device.

Graphical tools make it possible to view an expanded form (Figure 6) of the original source code and to view a graph of the unrolled and pipelined inner loop.

26
Figure 6: Expanded form of the original source code.
(Click graphic to zoom by 1.9x)

From this point you now have a reusable, tested FIR filter that can be shared with other teams and reused in larger-scale systems on chip integrations. Interestingly, the usability of the generated HDL gives you the option of segregating portions of overall designs to be further hand refined. We are living at a time when extensive hand coding can still beat machine optimization. Often teams will use C to create the entire design, then break out portions to return to HDL. Accordingly another option is to use the flow to generate HDL that exports to IEEE compliant HDL simulators such as ModelSim. This option results in an automatically generated test bench created as part of the design process. We firmly believe that “HDL ain’t broke” for glue logic, address control, state machines, and other Boolean operations for which HDLs work well. C is just another item in the tool kit that enables software and hardware developers to better collaborate on co-optimized designs.

Brian Durwood founded with David Pellerin in 2002 to provide C-to-FPGA based tools, training, and IP. He was a VP at Tektronix’s high-frequency MCM division in the ’90s, an original member of the ABEL team in the ’80s, and is a Business graduate of Brown University. He also received an MBA from University of Pennsylvania’s Wharton School of Finance. He can be contacted at brian.durwood@ImpulseC.com.

Mei Xu is a senior application engineer at Impulse Accelerated Technologies, where she works on embedded system design and FPGA programming. She got her master’s degree in Electrical Engineering and Computer Sciences at University of California, Berkeley. She can be contacted at mei.xu@ImpulseC.com.

Additional reading and links

Pellerin, David and Thibault, Edward A. Practical FPGA Programming in C. New Jersey, Prentice Hall, 2005

Gokhale, Maya B. and Graham, Paul S. Reconfigurable Computing: Accelerating Computation with Field Programmable Gate Arrays. The Netherlands, Springer, 2005

Readers may also contact info@ImpulseC.com to get a link to this and other tutorials.

Acknowledgments

The authors wish to thank David Pellerin cofounder, Impulse Accelerated Technologies, and Ralph Bodenner, VP Engineering, Impulse Accelerated Technologies, for their contributions to this article.

Impulse Accelerated Technologies www.impulsec.com mei.xu@impulsec.com brian.durwood@impulsec.com