Tightly coupling FPGAs with x86 processors

This article highlights the benefits of Acceleration Abstraction Layers and discusses how military systems designers can now leverage a new class of COTS board and accelerator modules to solve some of their most demanding tasks.

Radar, cryptography, and Software Defined Radio (SDR) applications can often benefit from being run on FPGAs tightly coupled with multi-core, general-purpose processors. But until now, makers of FPGA devices have had to develop proprietary accelerator middleware so that platform-level services would be accessible from their products. This adds cost to In-Socket accelerator products and may lock application software into specific accelerators, or even specific accelerator generations. Intel QuickAssist Technology includes both third-party In-Socket Accelerator (ISA) FPGA modules and an Acceleration Abstraction Layer (AAL) developed by Intel. AAL provides a consistent interface for application software so that underlying accelerator and general-purpose processor hardware can evolve independently and application software can more easily scale.

Uncovering the Issues of both GPP and FPGA

Radar, cryptography, SDR, and real-time surveillance are examples of military applications whose workloads are highly parallel. Over the years many solutions have been deployed that run on COTS general-purpose processor (GPP) blades. Using general-purpose processors rather than specialized hardware accelerators are cost-effective since software is typically easier to develop, can scale from small to large systems, and can be easily ported to new processor generations. Using GPPs to do all the processing may not be the most effective approach, especially when a platform's Size, Weight, and Power (SWaP) are constrained, or when extremely high-performance at high-efficiency are required.

FPGAs and GPPs: a good couple?

require less power than GPPs of similar performance. Power savings is obviously important for UAVs, for example, where lightweight power and cooling systems can translate directly into larger fuel tanks that enable the craft to fly further and faster. Real-time performance is another key parameter, as when a system is trying to track a moving target. accelerators can process some classes of signal and algorithms faster than GPPs, which decreases the time to compute and can help save lives.

FPGAs, however, don't excel at every type of processing that modern integrated systems must handle. Many military applications have been designed to run on hybrid systems with tightly coupled GPPs and FPGAs.

But developing such customized GPP plus FPGA boards is time-consuming and expensive. It can take six months or longer to design and build a complex custom PCB containing processor(s), FPGA(s), memory, interfaces, and other components. Issues uncovered during hardware validation may require "blue wires". So boards may need to be re-spun, which adds cost and delay. Even after a new board boots correctly, application code developers must wait until communication mechanisms are designed, built, and debugged so that their application software will interface to the board hardware. This communication layer adds to the cost of developing new accelerator technology. It also locks in application software to the specific communication mechanism used on a specific board design. Software may need to be substantially changed when board design changes or even when new versions of accelerators are released.

This leaves such non-COTS-based solutions in a vulnerable state. EOL issues can happen at any time, even with major components, as the recent, unexpected removal of PA Semi and their future products demonstrates. Just as bad is that processor and FPGA technology is evolving so quickly that a custom PCB can be obsolete the day it is released to production.

Lastly, feature creep can get in the way as new requirements emerge from the battlefield that require larger amounts of processing, less power, or even different mechanicals. If more sockets, more cores, or a higher mix of FPGA to CPU are now required, the designer is forced into a new spin of the board to meet new requirements. These new challenges equal higher cost, longer project delays, and increased room for competition to get to the market first.

Until recently the lack of flexible, COTS-based solutions that tightly couple FPGAs to CPUs have pretty much dictated a custom design route. But ultimately, custom designs can be the reason your project misses requirements, budgets, or future product life targets.

In-Socket Accelerators and the Accelerator Abstraction Layer to the rescue

Intel has been working with third parties such as XtremeData, an expert in FPGA acceleration technologies, to develop a comprehensive approach to hardware-based acceleration. These In-Socket Accelerators are small plug-in modules that contain FPGAs, memory, and all the necessary support circuitry. These modules work on COTS systems by plugging directly into processor sockets on rackmounted servers, , or bladed server boards. The FPGA modules are thus tightly coupled to the system's GPPs through the low-latency, high-bandwidth Front Side Bus (FSB). (Future modules are also planned for Intel's next generation QuickPath Interconnect [QPI] technology.) Sophisticated FSB protocols ensure FPGAs and GPPs have a coherent view of memory and caches.

Application software scalability and extensibility is being addressed with high-performance abstraction middleware known as the Accelerator Abstraction Layer. AAL provides a uniform set of platform services for both GPPs and In-Socket Accelerators. It decouples applications from system implementation details so that the exact location, number, and taxonomy of GPP and Accelerator resources are transparent to the application.

As Figure 1 shows, the AAL does not define domain-specific libraries or functions. Instead, it provides consistent interfaces that existing libraries and frameworks can use to interface to hardware accelerator modules. Developers can use the programming languages, libraries, and software development environments they already know.

Figure 1
(Click graphic to zoom)

AAL provides platform level services for basic system operations such as discovery, binding, transport, and exception handling. AAL also provides services for the use of shared system memory as an efficient means of passing data between the host and accelerator. Unnecessary memory-to-memory copies are eliminated and memory is efficiently mapped between the virtual address space used by the host CPU and the physical address space used by the FPGA accelerator.

The AAL has two major components: a common Unified Accelerator Interface (UAI) and Accelerator Abstraction Services (AAS). These provide access to the algorithms that enable accelerated performance encryption, very large Fast-Fourier Transforms (FFTs), complex FIR filtering, and so on - without the need for application programmers to delve into the minutiae of a specific accelerator's architecture.

There's an Accelerator Function Unit at the door

Any accelerator design block can be plugged into the FPGA to become an Accelerator Function Unit (AFU). A double-precision 64K FFT could be one example of an AFU that runs on an In-Socket FPGA module. After an AFU instance has been created, it can be used in conjunction with the domain-specific accelerator library to implement an accelerated function. They can also be stitched together with other AFUs to create an array of accelerators, such as the XtremeFFT SeaOfFFT() socket solution. AAL provides a zero-copy interface to minimize latency and maximize throughput. It does this by allowing an application to access a shared memory block that is mapped (and locked) into user space and accessible to both the accelerator module and the general- purpose processor. The application first allocates a block of shared memory for the source and destination matrices. It then initializes the source input buffer in shared memory and calls SeaOfFFT() in the domain-specific accelerator library. The library creates a message, includes the virtual pointers to the input and output buffers, and calls the Accelerator:: ProcessMessage() function. The AFU proxy forwards the message to the appropriate Accelerator Interface Adapter (AIA), a software layer that provides a uniform device transport interface (for example, device driver). Inside the AIA, a trans-action descriptor is created and queued to the appropriate AFU for processing. When the accelerator processes the doorbell signal from the host CPU, it processes the transaction, reading the source buffer and calculating the FFT's result, which it writes back into the shared memory. When it is finished, it issues another doorbell signal back to the host CPU and the AIA device driver reads the transaction result and triggers a call back to the accelerator library indicating that the very large FFT function has completed. Finally, the application is able to read the results directly from shared memory (Figure 2).

Figure 2
(Click graphic to zoom by 2.2x)

Reaping the benefits

The benefits of using AAL and In-Socket Accelerators for military system designers are:

  • Decreased development time. Proprietary acceleration layers no longer have to be developed for each new device. In-Socket Accelerators will work with a variety of standard rack-mount or bladed server boards.
  • Increased flexibility. Designers have more options for balancing power consumption, processing speed, cost, and feature sets. End-users can choose devices and solutions that fit changing requirements without being tied to a particular accelerator generation or even a specific accelerator technology.
  • Evolution ready. AAL will allow safe and easy migration as processor design changes, and next generation In-Socket Accelerators and future bus technologies come to market. AAL will also allow safe and easy evolution into the future.

True "accelerated COTS solutions" are now available. When shrinking development time is of the essence, FPGA-based In-Socket Accelerators working with general-purpose processors via the AAL on COTS platforms are up to the challenge. Solutions like these can help designers gain significant competitive advantages. You can now design and deploy lower cost, lower power systems that solve problems faster. And best of all, your software investment is now accelerated and portable at the same time.

Geno Valente is VP of Sales and Marketing for XtremeData, Inc., maker of very high-performance database Decision Support Systems (DSS) and other accelerated appliances. Geno has spent the last 13-plus years helping support, sell, and market FPGA technology into markets such as Financial Services, Bioinformatics, High-Performance Computing, and /, while working for Altera Corporation and now XtremeData.

Peter Carlston is a Platform Architect in Intel Corporation's Embedded Computing Division. He has held a wide variety of systems and software engineering positions at Intel and Unisys.

XtremeData, Inc.