ìProjectingî images in radar and medical applications
Examples of computationally intense systems being shrunk by use of FPGAs are of keen interest to us. Filtered backprojection is finding its way into both radar and medical imaging applications, and is well served by FPGAs to handle a portion of the algorithm. The results are outstanding.
Searching for an enemy vehicle from a high-flying Unmanned Aerial Vehicle (UAV). Taking a child after a tumble from a bike to the hospital for a CT scan of their broken arm. The imaging techniques in these two seemingly unrelated scenarios rely on the same technology to produce accurate results: the Filtered Backprojection (FBP) algorithm.
The FBP algorithm reconstructs an image of an object by calculating an exact solution for each pixel from a series of planar image data sets (projections) of the object. Both radar and medical imaging require a high degree of imaging accuracy. The FBP algorithm provides excellent imaging accuracy but at a large computational cost.
Heterogeneous reconfigurable systems provide the required computational power for image reconstruction to both Synthetic Aperture Radar (SAR) Backprojection imaging and Computed Tomography (CT) imaging while delivering such significant benefits as smaller form factors and reduced power consumption.
Partitioning the FBP
The general implementation of the FBP algorithm is quite straightforward. A two-dimensional set of floating point data representing the energy detected from illuminating an object (using radar energy for SAR, or X-Rays for CT) is filtered to remove noise, then each projection‚Äôs contribution to the reconstruction is summed into each pixel in the image.
For images of a useful size, the summation operation over all pixels is the computational bottleneck. Summing each projection‚Äôs contribution for a single pixel is fast; summing the effect of each projection over one million pixels for a 1024 x 1024 image - or even larger images - is intensely time-consuming.
Programmers can achieve high performance for FBP algorithms by spreading the application across an FPGA and traditional CPU combined in a heterogeneous reconfigurable system as shown in Figure 1. An FPGA contributes to a reconfigurable system‚Äôs performance by allowing a programmer to explicitly and completely dedicate a device to the solution of the regular, uninterrupted streaming aspects of a program. Along with this computational intensity, an FPGA also provides parallelism which can be exploited. The CPU, with its high clock rate, is an excellent workhorse for executing the irregular and conditional aspects of any program.
Equally sized sets of projection data stream into the FPGA for filtering, which is easily implemented in the frequency domain as a multiplication operation. Spatial domain projection data sets are streamed through a fast fourier transform (fft), multiplied by a filter coefficient matrix, and then converted back to the spatial domain using an inverse fft. These three operations are all pipelined (overlapped in time) for maximum data throughput through the FPGA.
After filtering, the backprojection (summation) step of the FBP algorithm is independent for each pixel, and so may be implemented as a set of simultaneously executing parallel summation units in the FPGA program for maximum application performance.
In a typical imaging application, the microprocessor generates appropriate filter coefficients and acquires the sensor input data. After FBP processing by the MAP processor, the final image data may be displayed or stored by the microprocessor.
Shrinking an SAR system
Given the high computational cost of the backprojection algorithm, it can be an engineering challenge to deploy such radar imaging in portable computing systems and small UAVs while keeping overall vehicle size, weight, and power (SWaP) down to acceptable levels.
During the early specification requirement studies for various UAV programs, the United States Air Force was concerned about the feasibility of deploying SAR backprojection on mid-sized UAVs. Could a reconfigurable SAR system meet both SWaP requirements and processing requirements?
To address these concerns, engineers from the Air Force Research Laboratory (AFRL) and SRC Computers jointly benchmarked the Spotlight Synthetic Aperture Radar (SAR) algorithm in 2005. The results obtained on the SRC-6 Portable MAPstation, which contains both an Intel 1.6 MHz Pentium M CPU and SRC‚Äôs FPGA-based Series F MAP processor, showed a 75x performance increase using both the CPU and MAP over using just the CPU. The absolute wall clock time for the benchmark execution exceeded the Air Force requirements.
Based upon these published results, Lockheed Martin recently selected SRC to provide the Signal Data Processor (SDP) for the United States Army‚Äôs Tactical Reconnaissance and Counter-Concealment Enabled Radar (TRACER) program. This system, scheduled to fly on the Warrior UAV in 2008, contains multiple MAP processors for even greater throughput.
Extending to medical imaging
Last year, discussions with customers in the medical imagingfield indicated that CT scan image reconstruction used the same FBP algorithm as SAR imaging. The application engineers atSRC Computers found an open source CT scanner simulation program, CTsim, and ported the FBP portion of this medical imaging application to the SRC-6 MAPstation.
The results of this implementation, reconstructing a 1024 x 1024 image showed a 22x performance improvement with the combined CPU and Series E MAP over the MAPstation‚Äôs 2.8 GHz Xeon 32-bit CPU. These results indicate that manufacturers of CT scan equipment could increase the resolution and clarity of the output of their equipment, draw significantly less power than traditional CPU-based systems, and still enjoy faster image reconstruction.
SRC-7 MAPstation: Early results
When the new SRC-7 MAPstation started shipping in 2007, application engineers at SRC began work to obtain updated application performance data on several programs, including CTsim and SAR back projection.
For CTsim, a direct recompilation of the existing Series E MAPprocessor code without using any of the Series H MAP enhance-ments yields a 29x performance improvement when compared to a 3.0 GHz Xeon 64-bit CPU. Proposed modifications to the FBP implementation in CTsim in order to take advantage of the Series H MAP (shown in Figure 2) suggest that a 60x performance improvement may be realized with minimal effort.
For SAR backprojection, optimization work is underway to port this code to the Series H MAP. Analysis indicates that a 104x performance improvement (relative to a 3.0 GHz Xeon 64-bit CPU) is a reasonable expectation when this work is complete. It is worth noting that the high-performance fftw library was used to obtain the CPU-only execution measurements. The final performance results for CTsim and SAR backprojection executing on the SRC-7 will be published as soon as they are verified.
Some of the SRC-7 system improvements that enable this performance include a 50 percent faster FPGA clock rate, afive-fold increase in system interconnect payload bandwidth (from 2.8 GBps to 14.4 GBps), and the adoption of Altera FPGAs, resulting in a more than three-fold increase in the MAP processor‚Äôs floating point performance.
Reconfiguring system performance
Actual, realizable application performance on reconfigurable sys-tems is important, but it is also good to consider other aspects of application development. FPGAs were once the domain of hardware engineers and Hardware Descriptor Languages (HDLs) like Verilog and VHDL, and programming FPGAs was notoriously difficult.
Recent advances, however, now make reconfigurable systems accessible to traditional software programmers. SRC Computers provides a unified programming environment, called Carte, which allows the programmer to use ANSI C or FORTRAN languages for combined CPU and MAP coding.
Software engineers, take note. Reconfigurable computing is span-ning more industries and applications than ever before, and is no longer just a playground for hardware engineers.